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COMPUTERIZED ADAPTIVE TESTING 



■ FOREWORD 



Thf plan foi a confcitncc devoted to the state wf xtscaich in ihc lic?d ijf tAXiipjictucd adaptive testing grew oui of a 
$uS««tSon made in late 1<J74 b> Farcderiw M Lord of Educational Testing Senitc. As one of the principal psychomcinc 
arcMteds of the latent trait theory of mental abilitici, whlJx undcriies the work bcin^ done in this ficid,Di. Loiduhscircd 
that if was now time to bring togeihei as man> as poswbk of the pevple doing this 3«roit, fui an overview uf the^iate of the 
art It was then decided that the appropriate sponsui^ ^fsuJi a conference wcic the Kav>, whose OfTiLe of Naval Rescardi 
' fiinds computerized adaptive testing piojects in maitax> and educational oiganizalions. and the U-S- Gvil Senrjcc 
Commisaon, where ps>'cho!ogisls in the Pttvonnel Research and Development Ccnicihm. been lanyingoui research m the 
area for a number ofyears. Accordin^y, represenUtivcs of these 1*0 offices met in March, 1975 to take the neccssaiy steps 
to oxganize the conference Members of the oiganizing committee were. Glenn L. Bryan, Dixe;;toi, Ofilce of Naval Research. 
Marshall J Fafr. Director. Pcrsonnd and Trauiing Research Programs, ONR, Joseph L. Young, Assistant Director, PTRF, 
ONR; William A. Gorhan< Director, Personnel Research and Dc\dopincnt Center, VS. Gvil Service G^mmission, Richard H 
McKillip, Chief, Research Section, PRDC, Vem W. Uny, Frank L. Schmidt, and John F- Gugel, Personnel Research 
Psychologists, PRDC. 

The principal objecthres of the u)nferenoc iver^ defined as CAvhangc uf infurmation. discussiun of theurcts'cal and cmpincai 
developments, and coordination of research effort. It was decided that the conference should be invitational, because of jis 
highly technical subject matter, and that invitatiorts would be ^ent lu those persons knuwn tu be interested in the subject. 
Nonunations were then nude of researchers whu Aould be asked to present papers and tu act as discussants. From the list of 
nominations, the committee selected those nominees it believed would /cprescnl the broadest range uf effort from theory to 
-practical application and w juld also represent organizations in the public, pmatc, ^nd mililar> sectors- Dr. Lord and Ben F. 
Green, Jr- of Johns Hopkins Umversit>' agreed to serve as discussants- 
Edmund F Fuchs was appointed conference coordinator to implement these decisions, and the conference was held as 
planned on June 12 and I?, 1975. in Washington, DC Sixt>-ei^t p*rsohs attended. Fourteen papers were read, and the 
discussants, who had studied the papers in advance, commented upon them. 

Informal discusnon during and after the conference and rcph'es to ^ short iiucstionnairc given to the attendees indicated 
that the objectives were successfull> met. In general, attendees fch that a follow-up conference would be desirable, to pursue 
further the potential of computers for the measurement wf human abilities. Two announcements ^'-c made at the conference 
sessions concerning ways of establishLng a continuous exchange of information among researchers- 
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OPENING REMARKS 



William A. Gorfiani. Director. Personnel Research 
and Development Center. U.S. Civil Service Conunission 

rd'Iikc to say first, a hean> welcome to >ou alL We arc delighted to ^ee this distinguished group assembled. To lead off 
the first cumputcn/.cd adapinc testing conference gives me a feeling of being, in the phrase Dean Acheson used as the title of 
his nicniyirs. ^'present at the creation." Sojiie of you may renicmbei that the quotation is from the words of Alphonso X, a 
tliirteeiith century kmg ol Spain, who said, llad I been present at the w.catiun I would ha%c given some Useful hints for the 
better ordering ol the universe " Well, I think that tlie principal value of this gathering is that we will have an opportunity to 
gjve useful hints for the future uf leseaich in the field of computerised adaptive testing. Our immediate purpose is the 
exchange of information, and of course this ib of benefit to all concerned. Byt we iiope the meeting will also result in the 
cjrealion of ways of awhievnig some otliei objectives that we considei important to tlie future of our research, continuing 
ccKchange ol mfumiation. identification of all people working on voniputeri/.cd testing, continued discussion of both 
theoretical and cnipmcaJ developmentb, and the coordination of researcji and development effort. I won't elaborate upon 
I lese objecli es riglit now-by the end of the conference we will all be in a better position to evaluate them and to devise 
ways of accomplishing them. But I would like to ^ay that it seems to me that our essential task is to achieve an orderly 
progress of research tlial will avoid needless duplication of effort but that will at the same time allow the widest possible' 
range ol clfort^a system lliat will aid but not constrain the people who use it, and that will be our common responsibility 
Thc^llrbt btep along the path to that achievement is the kind of exchange that will take place today3nd-tomOrfow 

I look forward to hearing contributions from all of ypu. 
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PAPERS PRESENTPD 



GRADED RESPONSE MODEL OF THE LATENT TRAIT THEORY 
AND TAILORED TESTING 



INTRODUCTION 

There will be no doubt about the usefulness of the latent 
trait theory in tailored testing, or the computer assisted 
adaptive individual testing. This is a pilot study for actual 
tailored testing, using full and partial information given by 
a set of graded response items. The purpose of this study 
is: 1) to find out how tailored testing 'using mostly 
dichotomous items can provide us with good estiniates of 
ability compared with non-adaptive testing in wliichwe use 
the full mformatiun given bv the graded item lebpunseb, 
and 2) to find out possible branching effect of a graded 
item when we use one as the initial item in tailored testing- 
Actual data used in this study are: 1 ) the empirical results 
of paper-and-pencil tests, and 2) a hypothetical test v/ith 
response patterns calibrated by the Monte Carlo method. 
The data analyses were partly made in such a way that we 
treat the data as if they were collected in actual tailored 
testing situations. For this reason, we call it simulated 
tailored testing. Terminology will be used in the same wayr 
as ia Samcjima*s two Psychomethkq Monographs {d\ 
Samejima, 1969 and 1972), ' \ 

RATIONALE 

The consistency of the maximum likelihood estimator 
when the likelihoqd function is given by the product of 
identical probability density functions or probability func- 
tions has been proved by Wald /Wald, 1949) and the proof 
has been shown in a simplified form by Kendall and Stuart 
(Kendall and Stuart, 1961, Chapter 18). In the latent trait 
theory, this situation corresponds to the case where alUhe ^ 
Items are equivalent, i.e., when the sets of operating 
characteristics of item response categories are identical for 
all the items, either on the dichotomous or graded response 
leveL This, of course, is a fairly restricted case, and, in 
practice. We usually have to handle the sets of operating 
characteristics which are not identical. 

The proof can easily be expanded to the case in which 
the probabiht> density function^, or the probability func- 
tions, are not identical, but observations increase m number 



f FUMIKO SAMEJIMA 
University of Tennessee 

following a relatively mild restriction. Let , ^2 ♦ - ■ ■ be a 
set of independent random variables having identical distri- 
bution with the mean 11. The strong law of large numbers, 
which is used in the above prool, states that for any given 
positive numbers c and 6, there exists an A^such that 

prob, I- M 1 > 6 ] < 6 for every // > N. (2-1) 



Let us define two positive integers, m and r, and consider 
such that 



where r is a fixed number, however large U may be. Let 

In* I12 hx^ l2i.-.-, l2r- - be a^ set of 

independent random variables^ which are classified into 
disjoint subsets, A| = {in, Ii2»--lir}' -^2 ={l2ii 
?22» ■ ■ ■ l2r}* ■ - ■ ■ Let us assume that within a -subset Aj 
the r random variables are not necessarily identically 
distributed, but among the subsets we chn always corre- 
spond, without overlapping, one random variable from each 
subset Aj (j = 2,3, . . .) to each element of Ai which has an 
identical distribution with that of the element of Aj with a 
specified mean. Let ptj^ (k = 1,2, . . . ,r) be the-mean of || 

If we define random variables such that 

0=^2 ) (2-3) 

/:=1 

then these random variables are indeperdent and identically 
distributed, with the mean such that 

E(?/)=f ^ ^r-^- (2-4) 
k^i 



Thui» the >trung law uf large numbers is applicable lor J^, if 
nut fur ^ji^. Using this mild restriction, we can write 
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prob. llu^ i ym < log /- y{0 )1 = I < 2.^) 

where 0 is the iiiaxiimim likelihood cstniiator of the true 
parameter 6 . which leads to the completion ot" the proof of 
the consistency of the maximum likelihood estimator. The 
bame restriction enables iis to prove the ultimate 
uniqueness of the maximum likelihoud e:>tmiati>r, the 
asymptotic efficiency and normalit> ot the maximum 
likelihood estimator, with the asymptotic \ariantc 

< We notice that (2-6) is the reciprocaf-^of tlie test 
intbrmation function. \{0), Thus if we can reasonably 
assume that there ate at most a finite number of non -iden- 
tical sets of operathig characteristics and the number of 
items given to an examinee increases by repeating r items 
-whose sets of operating characteristics arc the same as these 
sets, but possibly arranged in different orders, the maxi- 
mum likelihood estimator ultimately distributes normally 
with the true value 0 as its mean and the reciprocal of the 
test information function as its variance For this reason, 
when // is large, HO) can be considered as a good measure of 
accuracy of estimation. 

Let us consider the meaning of the information function 
when //.is relatively small. In an extreme case where /; = 1 , 
the test information function 1(0) equals the item informa- 
tion function I^iO). It has been shown tlijit, as long as the 
model satistles the unique maximum^bndition, like the 
normal ogive or the logistic iiiodeK the item response 
information function (0) is positive for the entire range 
of 0, except, at mostf at enumerable points of 0 (cf. 
Samcjima, 1973). Under that condition, theDasic function 
A. (^?) such that 

A,m='^\o^\\(0) (2-7)* 

' g oO ^ 

IS strictly decreasing in 0 , and the item response informa 
tion function is given by 

I ((?) = - A /I (0). 

\ 

Thus the item inlonnation function, wluJi is given a:> the 
expectation of l^^KO ). such that 




can be considered as the expected steepness of the basic 
function Ay^.KO) for item 4'. If we consider the response 
pattern information function. /j^lC?), such that 

^ym--^^V-^P^AO)- X (MO) 

thib IS a measure of the steepness of the left hand side of 
the likelihood equation which is set equal to ?ero The item 
re:>poiise information function /v (^)» therefore, is the 
Jiare or contribution, of each resjponse.v^ to the response 
pattern V of which .v^ is an element, and the lest 
information function l{0), which can be written as 



l{0)^E\lym\ = yyKO)Py{0), (Ml), 

where ]/ means the sum over all the possible response 
patterns, is the expected steepness of the left hand side of 
the likelihood. equation whi|:h is set equal to zero. Since we 
can interpret the steepness^ of the left hand side of the 
likelihood equation as a measure of accuracy of estimation, 
the test information function can be considered as a 
measure of accuracy of estimation even if // is relatively 
binall. Following the same logic, the item information 
function l^{0) can be considered' as the expected contribu-. 
tion 10 the accuracy of estimation by adding item iT to the 
test. For this reason, the item information function will be 
given an important role in the selection of item-and-way- 
of-dichotomizatioii in the present study of behavior of 
maximum likeliliood estimates in a simulated tailored 
testing situatl:*'". 

Suppose that we have collected testing data of// items, 
each ol Which is scored into graded categories, 0 through 
O I). It has been shown that the item information 
function assumes much greater values for a graded item 
than a dichotoinous item, and the problem of attenuation 
paradox is ameliorated for a graded item (cf. Samejima. 
1969. Chapter 6).. Thus it is obvious that, if we rescore each 
of the n items dichotomously, choosing one of the 
category borders for dichotomization, then the accuracy of 
esiimation of 0 will be lowered. A question will be raised 
here, how much accuracy of estimation can we still 
maintain if we tailor a set of /2 optimal dichotomized items 
to an individual subject, instead'^ of giving a set of n 
uniformly dichotomi/edTtems to all subjects? To find this 
out. we can select an initial item out of all the// items more 
or less arbitrarily, and treat it as if it had been presented 
first. If we convert the initialMtein to a dichotomous item 
by choosing one of the borders for didiotomi/ation, the 
examinees' item scores for that item, which range 0 througli 
/// , will be converted to either 0 or 1, depending onahb 
category border used. Following the nomial ogive model of 
the graded or dichotomous response level (cf.^ Samejima, 
1969, Chapter 9; 1972), the first estimate, 0\s ^iH be 



obtained, if the itenv^rc a 0, then 0 1 wiQ be <«. if it i> 
oQ the graded fcspunsc levn ui I on ike diLhutvoKiu^ 
lesponsejeyetihcn^i ftiUbe^^^^^d.vthenvise.it »illbe^ 
finite^ value. SiTicn li acgauw infimi>, the ceM item and 
^jlirUr'ay ofdkhotomizaiion will be chosen by scaichin^ the 
"Tcast *2lue of bxg lmmt those of the remaining (n 1) 
JtcI^$^^nd. »*en ^| is puaavc inCnity. the gredtcst *X|ris 
searched and used. When 6^ is a finite vzlue, then the item 
and border ^luch nuke the item infomauon functimi for 
« the. didiotomized nem maximum 216-6% is chosen Ziid 
treated asi4he srcon^ prcsenut^on.' Ln this way, the second 
^estimate, §2* mil be obtained, and the process i^ill be 
repeated until we get the nth estimaic, 6„. 

Hiis simulated tailored testmg situation ts different fium 
the awtual tailored testiog situation,^ the sense thai the 
selection is more limited m latei presenuiions of items. In 
the ordinary case, we start with a laige set of dichotumous 
lesi Items, ^td the numbei uf items is reduced b> une afici 
each taduied presentation. In the present smiulated tailuxed 
testing situaUon, howevex, the numbei uf^items 1^ icduwcd 
by m^, after the presentation of item ^, and at i itSiast 
prez^ntaiiun sckcUun is made un]> uui uf/Ta/, pussibiliiitt, 
where h is the remaining item. .This wfll make the 
estimauuh m^re mefficicnt in laici piuwtiscs. and shuuli be 
kept ih mmd when obsen^atiuns are inadt fui the results uf 
the data analysis. 



E.MP1RICALDATA AND THEIR ANALYSIS 



A test -of 18 items was constructed for research 
purposes, each of which is to be scored in a graded wa>. It 
consists of two subtests, figural (FGR) and numerical 
(N[^B), the former having ten items and the latter havmg 
eight Items. The initial instnictions for each subtest, and 
also a hypothetical NMB item, which was made for 
illustrative purposes are shown in Appendix A. 

The test was admimstered to 446 subjects, mostly 
college and summer, school students in the United States 
and Canada, in March through' My, 1974, to get the 
complete data of 406 subjects. In some sessions FGR was 
presented firsts and In some others NMB was presented 
first. Etch- session required approximately 90 minutes. 
Including initial instructions and five minutes* break be- 
tween the two subjects. The number oC subjects in each 
session varied from one to 36, but in man]^ cases it was less 
than ten. A time limit is set for c/ch item, and is between 2 
and 6 minutes, except for the last NMB item for which it is 
13 minutes. When there is one more minute left for each 
item, the instructor calls, *t)rie more minute to go,*' The 
full item-score, W-, is 3 for each of the FOR items and also 
for each of the first seven NMB item-,, and it is 7 for the 
eighth NMB item. For the FGR itenv. 1 is given for the 
completion of A and B, 2 for that of A Jirou^ D, and 3 



loi thai of A through E (cf. AR>todix A). For the first 
seven NMB items, thc'scofe is pven in aw^^nfeiK* with the 
numbei of correwt answers in each Item, and fox the last 
item the mtc is pvcn in a sinfflai. wa> as It is for a FCR 
item. 

It turned out that the tenth item ia FGR was too 
difficult fo{ most subjects, and it was ex^uded in the 
analysis of the data, to leave nine items for the subtest* 
FGR. It also turned out that frequencies for some item 
sa)rc categories were -400 small, so suitable rccategoriza- 
tions were made !o leave three item score categories for 
items 4, 6, land 8 in FGR, two for item 9 Ln FGR, and five 
for item 8 in NMB, maW^ e\'eiy frequenpjr, at least, as 
large as 18. For the 17 item variables, whicn are assumed 
behind the item scores, the multivariate normality was 
assumed, and the poI>ch6ric correlation coeflkient (cf. 
,Tallis, 1962) was computed for badi paii of the item, 
variables, usinj Lieberman^s program (Lieberman, 1969). 
The principal factor solution was applied for the resultmg 
^irelatiun mairix usmg the SPSS factor anJ>^ program 
with iteiat<vel> estimated communalities, tu obtain e^en- . 
values. 5.859. 1.757, 0.902, C.745, 0.578, etc., which 
pio^x the existence of a strongly domiruting first prindpal 
factor and a moderately dominating second factor. Several 
different factor rotations were made, both orthogonal and 
obUque,. fur these two factors, and the results uniformly 
showed the two clusters, one for ^ch of the two subsets of 
items, Le., figural and numerical. Table I shows the results 
of both "varimax and quartimax rotations, along with the 
ori^al factor loadings for the two principal factors. For 
this reason, each subset of items, i^e./ FI through F9, for 
FGR or Nl through N8 for NMB^v/as analyzed separately,* 
and the first principal factor for the figural set of items, 
whose ei^nvalue turned out to be 3.029 o. 602% of th^^ 
total sum of communalities, was named the figural ability^ ' 
and the first principal factor for the numerical set, whose 
eigenvalue was 4.132 or 79S% of the total communalities, 
was named the numerical ability. The item parameters for 
the operating characteristics, which follow the normal ogive 
model on the graded response level (cf. Samglma, 1969 Si 
1972), were calculated, using the formulas: 



and 



*x ' Tjt IPg ^^^^g " l,2,...,/w^ ; (3-2) 



where is the factor loading of item g and Jx^ is the 
normal deviate corresponding to ^the proportion of the 
subjects who got .the item score or greater. These 
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TABKl 



lion 



Factor LoaiSist 34atxice$ of the Seventeen lUsss for ibe First Tm CoicaKui Factors for the Ongsad 
tvmiapA TadozSt After Thty Vere Rotated Vsaj; Varixaax aad Qoartimax Kotations. 



RotalTOn 



Rotatioa 



Qsartimax 
Rotation 



* 


First 


Second 


•First 


SccosiA 


First 


Second 




Fader 


Fartf? 


Factor 


'Factor 


F^or 


Factor 




-485 


.371 


.106 


.601* - 


.611 


' J0fl5 


F2 ' 


jsn 


I455 • 


.143 


.749 


.762- 


J017 


F3 


S77 


386 


.163 


.675 


j69Z 


MO 


F4 


.424 


.154 


:207 


.400 


^29 


.139 


F5 


.432 


die 


.125 


.503 


316 


MO 


F6 


.433 


321 


.102 


329 


339 


-013 


F7 


.358 


.174 


^.146 / 


370 


3S9 


*• J083 


F8 


381 


.274 


.il3 


.440 


;452 


J039 


F9 


.502 


.106 


-298 


.418 


.461 


;225 


Nl 


.683 


-344 


.736 


ao^ 


326 


J691 


N2 


.750 


-.165 


-664 


3S6 


^90 


391 


H3 


.580 


-346 


.662 


•138 


^45 


.630 


N4 


.776 


-.193 


.702 


-383 • 


w493 


.630 


N5 . 


- .524 


-^10 


. .66> ■ 


j052 


.160 


JS45 


N6 


. -581 


-396 


^ J596:.. 


-102 ' 


:215 


.669 




.826 


-.133 " ~ 


^ .698 


.461 


370 


.613 


N8 


*.537 


.0861 


i 337 


.426 


-476 


362 



parameter V2]ues are presented as Tables 2 and 3 for tfie 
Hgural and the numerical abilities ie:«pectively. 

Since there is no wzy of knowing ezch cxammccj^s true 
ability score, the maximum likelihood estimate, was 
obtained from his response pattern of graded item scores, 
and was Ueated as the best pos^le '^timate of his true 
ability score. Also the test infoimatioa function, which is 
given by Equation 2-1 1, was calculated for each subtest, and 
it tumed out that the subtest NMB is far more infonnativc 
than the subtest FGR. Figure 1 presents the test infonna- 
tion function of the subtest NMB. Taking the interval. 



[-0.1, IjOJ, in which the values of the test information 
function are ho less than 7, we let the computer search the. 
best possible way of dichotonuzation of each item, to make 
the test infonmatlon as large as P^^^lc for this interval, 
and the resulting test information function is drawn by- a 
dashed line in Figure 1. A similar trial was made for the 
least informative way of dichotonuzation, and the ie|uitV|( 
test information function is shown by a dotted line in the 
same Jigurc. Selecting all tfie subjects whose $ are located in 
the above interval, the maximum likelihood estimate was 
calculated for each of these 138 subjects, using both the 



TABLE 2 



Item Piramctcrs For the Subtest FGR 



Item 
g 


Discnmination 
Index 






x^=3 ■ 


1 . 


0.S972 


-L0042 


-03356 


0.0833 


2 


13196 


-0-7468 


-03532 


-0^65 


3 


1.0160 


'1.2464 


-03137. 


0-1476 


4 


03775 


-0.7984 


0.1730 




5 


/ 0^940 


' -U081 


0.7169 


03554 


6 


0-6558 


-0-0337 


3.1045 




7 


0>*293 


0^722 


33345 




8 


03644 


-0-7988 


23679 




9 


03483 . 


2jrn 







erJc 
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TABLE 3 
Uem Txjzmcim For the Subtest HUB 



Itfitn 
Z 


Discnznisztioa 
ixvScx 




Diffiailty Indkcs^^ 

= 2 = 3 




1 


1-1«758 


-0-58387 


0.02422 


0^9302 




2 


12193% 


051100 


1^1130 


1.69291 




i 


050123 


-157011 


-L51105 


-0^8W 




4 


1^248 


OJ06765 


0-32693 


0J84445 




5 


0^0989 


-a99294 


-0-15721 






6 


0:93733 


-0^8721 


0.47768 


1.71261 




1 


I-58S94 


OJ02918 


036303 


0-72073 




8 


0^3530 


ai440i 


052872 


150170 


2J89J23 



must iiiiuimitm 2Ad the lea^i infuinuuvc w4>^ dwhuiu 
miratiofv. Figure 2 ^ows the seu of these estinutes plotted 
against We c2n see ^ ^abstanb^ diffeienwe bcitvccr. the 
two scatter diagrams. 

A question will be raised here, what wiH the scatter 
diagram be if we tailoi the wa> of dichotomuation foi each 
individual subject? To answei ihis^, a piogiam was yyntten 
to treat the data as if these cx^t items had been presented 



in lailoxed testing selcvting both item and Aa> of di Jioto- 
muation, as was described at the end of the ^icccding 
.section. Using the most informative dichotomized item, N7 
with the categoiy hoiAtt 2, the least infomiativc dichoto- 
mized item, N3 with the border 1, and a medium 
infoimative item, Nl with the bcrdei 2, the resulting 
swattet diagrams are.shownjnJjgureJ- Wexan^eJhaLin 
all these case^ ext:rernel> scattered points are rare, com* 




A B 



f Iguxc 1. Test inforauUon functions fux tht subtest NMB, *hcn 

the graded scoring strategy is taken ( when the 

most informative dichotomous scoring stratc^ is taken 

for the interval l-O.l, 1.0J ( ), and when the 

least informative dichotomous scoring strategy is taken 
for the interval [-0-1, l.OJ (— — ). 



r igurc 2. Maximum 2ikeliI..K>d tHivaiic^ obtained by dschotomzrin^ 
NMB items for the interval 1-0.1, 1-0| , plotted zpiinsX 5, 
those obtained from the ori^nal response j^attems of 
graded item scores for the 138 subJecU whose I arc in.thc 
intenral (-0.1, LO]. A. Using the most informative, wcy 
of dfcholomBation,,B. Using the lean informative way of 
dichotomtzatlon« 
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A 



3 



C 



fipiic i. Mawnjiini axcljhwd csumato obuincd 4>> babied uil^icd testing pJoltcd j^iinrt 5. those obuintd from ihr original mponsc 
p2nctns of jrackd item scores foi the 133 jct>jccu -hose 5 are in the znteiral f-O.I, IXK A-Ufia^ *hc most informathe 
dichoionJs«d iicnis. N7 wiih the category burda 2, as the xniOsl item, B. Ufln^ the Jeaist iafonnative dichotoimzed iteni N3 with the 
catcjoi> bordei I as the imtai item, C tsui^^a djcholomxied Jtcm of inedium infonsatjon, Kl with the otegouL b{>ldcX-2^Sjl>C'_ 
initial item. 



pared with Figure 2A, uc., the case of the mo$t infonnativc 
dtchotomization for the gxoup of these 138 subjects to say 
nothing about the companson with Figure 2B. This can be 
interpreted as a benefit obtained b> tailoring an individual 
test for each examinee. 

A second question will be raised here; is there aiiy 
substantial gain if we use a graded' test iten;, instead of a 
dlchotomous one. as the initial item in taOored testing 
Since the number of items is as sniaD as eight, it will be of 
benefit if the use of a graded item gives z substantial 
branching effect at the be^nning of tailored testing. To 
find this put, using the most informative and the second 
most informative graded items, N7 and N4, as the initial 
items respectively, the same emulated tailored testing 
procedure was applied to obtain the maximum likelihood 
estimate for each individual subject. The results are shown 
as Figure 4. To observe the possible branching effect, in the 
first case the total' 138 subjects were divided into two 
groups, one consisting of the subjects whose gra'ded score 
for N7 are either 3 or 0, i-e., best or worst, and the other 
consisting of those who obtained either 2 or 1. i.e., 
intermediate scores. We can see an obvious branching effect 
by comparing Figures 4A and 4B. 

16 



Similar analysis was made for the other subtest, FOR 
and the results arc presented as Appendix A- Since the 
maximum test information for FCR is a Htile more than 4 
compared with that of NMB which is almost 8, there is a 
general tendency that diagrams are more scattered, but, 
other than that, amilar tendencies as in NMB were 
observed. The uiterval of ability taken for these observa- 
tions was [- 0^, O.I I, there are 1 23 sul^ects whose d are in 
this interval, and the test information ftmction for this 
interval is greater than 4, with an approximate maximum of 
4251 at ^ = The initial items used for the simulated 
uilored testing are. F2 with the category border 2 (most 
informative}, F6 with the category border 2 (least informa- 
tive), F3 with the category border 3 (medium)) F2 (most 
informative graded) and F3 (second most Informative 
graded). 

Figure 5 presents two examples to illustrate how the 
maximum likelihood estimate converges in the simulated 
tailored testing, for NMB, using the five different initial 
items which were described in a previous pararfaplt It may 
be sUj^ested that the number of items, eight, is not 
suffiaent for all the, cases. It should be recalkd, however, 
that in the present study the selection of item-and-way-of- 



I 



A 



C 



I j^ic 4. Mjuixmuoi ijkclibuvNl otuiuio ^buxBcd by junubtcu uUuxcd ;csiui£ pJuttcd ^^Ain^i Jimjc '^buincd 4iuni 4jn^iSizl xesponsc 
paticms vl purled item Mxticy, foi ihc iubr^t^ *iiaic S ist m the intcrv^ { IX}. A. Usc^ the mou inf^^nnatnc graded item, 
as the miinl iicm. fox subjects whose item scores ioi N7 are extrexne, cithei 0 oi 3, B. Usn^; ihe ^ost infoxmathe paded 
i!em« N7« av the initial item, (vi subjzcn nhose item .scores for N7 ire istermeduse, ut,, extbei 1 or 2, C Usri^L !he second most 
informative leaded hem^ N4, as the initial ilcm^ 



di Jlolu^u^atlan is more more limited in later prcicnta MONTE CARLO DATA AND THEIR ANALYSIS 

tions of items. And yet each dichotomized respoitse pattern 

as a tvhole ts a 3clcctiun out of the 8,748 possibiiities. To make funhcr^^-obscrvations in the present simulated 

tailored testing, a hypothetical test of 24 items was used. 




t 



I ipire y I wo examples io shovv how ihe maAimum hkelihood estimates converse in the emulated tailored testing. Initial items are. N7. most 

mformauve ^jadcd item { )^ N4, second most informathre graded item ( ), N7-2, most infoimatrve dichotomized item 

(.-.-.)^ ^1-2, medium informative dichotomized item (- • • and N3-I, least informathrc dichotomized item {- • 

II 



The lion paomcicis were givca fvithin ihc lanfie of ihuic 
of NMB* so that cihis hypotheiical ie^t be vonudcicd ^ 
an expansion of NMB m a xuu^ ^cnve of ihc nvid. f Ale 4 
piesenU ihe item paianseiei^ of thac ivueniyfoui favpo- 
thetical ilcms, which have unifonnly foui item icorc 
catcgones each. The tax infoimaiion funtUi/n wa^ ub lamed 
following ihc formula (2-1 1 and presented ai Table 5- 
As u'c can see from ihis lable. this hypoihebcal Ic&l is mu>t 
mfonnatm around 0 =- -OJ. Foi chu leaMTn.onehundtcd 
re^nse paiicms foi these t%^ni>-foui tcsi items were 
calibrated b> Munte Cailu mcthad *^n luo Jt*xl uf abilil>, 
and were used as these of one hundred hypothetical 
M^>jei.ts. Fjguie 6 prc>ent^ the ^.uaialative fre4uenw> rali^, 
of $ foi ificse response pan ems. in vonipansun «vith the 
noim^ distnbutiun function with^i = OJando - 0^128, 
i.e-, I|V21.0SI. We can see that these iwu curves are Josc« 
and this indicates that llie niaximuni hkeld^uod estinutc 
mth these paiametei values aiicad> Ji^mbutes alm^l nui 
mally foi the 24 items. As before, the most mfomiaUve and 
least mfoimatr»*e didiotomizatians of the items ^^re 
scaiched, and the resulting muAunum likelihvfud estinules 
were computed fu: eauh of these une hundied h>puthetiwal 
subjects. Figures 7A and 7B present the cumulative fit 
quency ratios «f these estimates lugethei *vilh the h^mal 
distn'butiun funwtiuns with p - 0 J and the value* of the 
standard deviation obuxned by 1/Vf(~03;, wliich turned 
out to be 0.2407 and 0J685 respectively. Since in the 



pxeseni sitoilten the ability levd is flxed at OJ. the 
difference htx^ ttu the two standard deiiations, OJIIZS and 
0^407, jhouJd be inteiprcicd as the imninuztd reduction 
vauscd by adopting the &iiiOXQTC^m, scoring strategy, and 
the one between 0.2407 and 0 J685 Aould be attributed to 
the two different ways of dichotormzation. It 1$ also 
noticed that the disjrcpandcs i>clween the normal curve 
and the oimulative frequency ratio are more conspicuous in 
these f-vo didiolomized cases compared with Hgurc 6. 

Figure S j^ows the same cumulati^'e frequency ratios 
compared with N(- 03^02128), for the maxmium like 
lihuod estimates obtained by the simulated ta3ored tesiirig^ 
with the Vxxt different initial items. (23 2). ihe most 
infurmalive diwhotomous. (3 3), the least informative df 
Jiotumaus. (14-3)y a medium informative dichotomous, 
(24), the most informative graded, and (23), the second 
most infv..«ative graded, respectively. The mean square 
errors for these five cases are 0.064, 0.068. Oj055, Oj056 
and 0.058 respectively. If we take the square roots of these 
values, they are 0.253, 0.260, 0.234, 0. 236 and 0.240, 
wliich are comparable to 02407, ix.. I/v^fl[- 03) for the 
result of the most informath'e dichotonuzation vase. This is 
understandable bevause in that case the dichotomization 
was, indeed, tailored for the Ie\'el of ^ = - 03. To find out 
about the brandling effect of the initial graded items, four 
more 4,ases were added using four different dichotomized 
initid items of various information levels, and the results 
were arranged in Table 6 in the order of information levels 



TABLE 4 

Item Pziamcxcrsof 24 Hypothetical Test Items 



item 


Discrimination 




OifHailty Indlccs^Ty 






Index 


















1 


0^0000 


-0.70000 


-0.50000 


0.20000 


2 


0.50000 


-2J00O00 


-0.80000 


-0.20000 


3 


0.60000 


0.30000 


0.80000 


2.10000 


4 


0.60000 


0.0 


0^0000 


1.30000 


5 


0.70000 


-1.30000 


-0.20000 


0.40000 


6 


0.70000 


oaoooo 


0.90000 


2i)0000 


7 


0.80000 


-0.50000 


0.80000 


1.90000 


8 


0.80000 


-1.10000 


-0.90000 


-0.10000 


9 


0.90000 


-0.20000 


OvfOOOO 


0.60000 


10 


0.90000 


-1.60000 


-li)0000 


0.20000 


11 


1X)0000 


-1^0000 


-1.10000 


-0.60000 


12 


1.00000 


0.10000 


1.40000 


1.60000 


13 


1.10000 


-0.10000 


0.80000 


1.10000 


14 


1.10000 


-1.00000 


-0.50000 


oi) 


, 15 


1.20000 


•uoooc 


-0.20000 


0.80000 


16 


1.20000 


'1.70000 


-oioooo 


-0.50000 


17 


1.30000 


-0.30000 


0.50000 


MOOOO 


18 


1.30000 


-0.60000 


0.40000 


0.80000 


19 


MOOOf 


-0.90000 


GL30000 


1.10000 


2P 


1.40000 


-0^0000 


-0.10000 


a60000 


21 


, uoooo 


•1.90000 


-1.60000 


-1.20000 


22 


uoooo 


-1.50000 . 


-0.40000 


050000 


23 


1.60000 


-0.80000 


-0/10000 


0.80000 


24 ' 


1.60000 


-MOOOO 


-0.60000 


0.40000 



12 

18 



of initial items. We can stc from^his table that, wlh iht 
cxctpdon of (14-3), the values of the mean square errors 
are greater for the cases in whldi we used di^otomized 
items as the initial item, than those for the cases in which 
graded items were used, althou^ the differences are small. 
To make 3 more detailed observation, two cases, in which 
(24) 2nd (14-3) were used as the initial item respccti^-ely, 
were picked up, and these values 'A^ere calcubted for ihe 
maxinnmi likelihood estimates when 4, 6. 8, 12, 16,20 and 
24 items vtxit used respectively in the simubtwl tailored 
testing. The result is presented as Figure 9. in the fonnof 
the comparison of the corresponding square roots of the 
mean square errors. We can see that the branching effect is 
conspicuous for the cases of (cwei items, namely, 4, 6 and 
8, and disappears with the addition of more items. This can . 
be interpreted that when we add more items the effect of 
the initial item becomes negligibly small. Note, however, 
that in the present simulated tailored testing situation the 
selection of item-and-way-of-dlchotomization becomes 
more and more limited in later presentation of items. 



TABI.E5 

Test Infcmution Function of the Hypothetical Test of 
24 Gizde Items 



AbxCty 
0 






Infomutttoo 
Function 

m 








16.317 








17.250 


-13 






18.119 








18.915 


-1.1 






19.628 


-1.0 






20.252 


-0.9 






:20.784 


•0.8 






21.220 


-0.7 






21.562 


-0.6 






21.813 


-0.5 






21.979 


-0.4 






22.065 


-0.3 






22.081 


-0.2 






22.034 


-0.1 






21.930 


0.0 






21.776 


0.1 






21.574 


0.2 






21.326 


0.3 






21.030 


OA 






20.681 


OS 






20.273 


0.6 






. 19.800 


0.7 






19.256 


0.8 






iS.636 


0.9 






J7.938 


1.0 






17.164 








16.318 


1.2 






15>i09 


1.3 






14^9 


M ^ 




i 


13.452 


1.5 






1^435 




I fguit. 6. Cuniiilatnx frcqucxu,> nUo of imxsmum li5^dihood csti> 
nntes obtained from the odgina] response patterns of 
frsdcd item scores for the 100 hypothetical subjects 

( ) and the nonnzi distrib^ition function ( — } 

jnUi the parzmetexsp = -0.3 and c = 0:2 128 




1 

. B 

Figure 7. Cumulative frequ^icy ratio of maximum likelihood esti- 
mates obtained from converted response patterns:' 
A. Vdn§ most informative dichotomtzation of items at 

5 = ^.3, for the 100 hypotbesued subjects ( ^ 2pd 

the normal distribution with the parameters /i- -0.3 and 

a = 0.2407 ( — ), B. Usin^ least informative dichoto- 

mizatxon of items at d * -0.3 for the 100 hypothetical 

subjects (-^ ) and the normal distribution function. , 

with the parameters $t " -0.3 and a - 0.3685 ( ^. 
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htpJicS, UimuUiivc licqucftc> jatjo wi nuxmum iikchh^j^ c«imaie> vbUjncd by mnuUtcd tailored icsiia^ fui Ihc 190 h>poxhctka! 
Mbjccu t~ ~ -) and the norma! distnbuiton uilh ihc parameters >i = -0-3 and o = 0,2128 ( ). A. wrth ibc mosi informative 
dichotomized item c23-2/ a> ihe in-iial iicm, B. with iht kasl mfwrnxatne dKhutomued item (13). a* the initial i^«n, C with a 
medium mtoimaiivc dichoiomucd iicm il4-3; a> the iniiul item, D, with the mu$l mfurnuthepaded Item C24) as Ihc initial item, 
£. vdlh the second <Ro^ informative ixaded item (23) as the initial Item. 



TABLE 6 

Mean Square IsTois and Other Indices for the Variability of the Maximum likelihood Estimates in 
the Simubted Tailored Testing Oan^ Different Initial Items in NMB, 













Mean 








Initial 




I^C-0,3) 


Square 








Item 






Error 








3- 


3 




0.104 


0.068 


0.260 


14.767 




5- 


1 




0.260 


0M9 


di63 






10- 


3 




0.479 


0.060 


0.245 


16.723 


Dichoto- 


14- 


3 




0.740 


0.055 


0.234 


18.281 


mous 


18'- 


A 




1.018 


0,066 


0^8 


15.051 




23- 


1 




1387 


0M3 


0.250 


15.938 




23- 


2 




1.615 


0.064 


0.253 


15.580 


Graded 


23 






2.074 


OJQSS 


0.240 


17.332 


24 






. ^127 


0.056 


0.236 


17-980 
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FjpiTc 9. Ccflipziijofl of ihc 5qu2re roots of titc mean square ctxots^ 
of the maxunum likelihood estimates in simubted tai- * 
lorcd testing with the fiadcd item (24), plotted uith x 
and the dichotomized item plotted witbo, as the 

miiisJ Item, calcubted for 4. 6, 12, 16, 20, and 24 
ilcmsL 



DISCUSSION AND CONCLUSION 

Throu^ the observations of two types of data, it has 
been nude clear thai taflored testing, in which we use 
dichotomous test items only, can provide us with much^ 
more accurate estimation of ability than non-adaptive 
testing, and that accuracy is almost comparable ^th that of 



graded re^onse level We also have observed that the 
branching effect-by ti^ng a graded Item as the initial item Is 
con^icuous ^iign we use _a relathrely small number of 
Items. When the luunber oi items Increases in iailored 
testing, however, the effect of the initial branching, or the 
amount of Informaticm given by the initial item, seems to 
have a less effect on the final estimation. On this point, we 
need a further study by using a larger number of items in 
the ori^al set of items, and also an item with more score 
categories as the initial item. 
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APPENDIX A 



I. INSTRUCTIOXSrORTHE FIGURAL SUBTEST ' 



3. rmi l.NUMERICALSUBTEST 



There arc 10 items in this put of the test. In each item, nine 
ikures are 2snnpsd4n three rovkS and three v-olumns, tut> cf nisidi 
are mhiai^ as shau*n belcm*. These H^turs are amnped accordb^ to 
some nile, and you must ixnd thai rule by obsenins the ssren 
Inures shorn in the amy. 




> 



Below this array, tweh'e figures arc^ven, and you are to choose the 
right f^ures for the mtssiii^ ones in the abwe arr4>, A and B. 

Next« we add an additional eolumn as shown above. You arc to 
choose the li^ht figures for C and D out of the same twelve choices. 

After you have foliou'cd the abovr tu-o steps;, then you are to 
draw the ri^t fituTc fox £ in the additional column. This figure may 
or may not be one of the twelve choice^' 



Don*t (um the pope until you are 
told to d05o by the instnjetor. 



2. INSTRUCTIONS FOR THE NUMERICAL SUBTEST 

There are 8 items in this part of the test. In eaeh hem, a specifle 
rule is given, and you arc to xcsd the instruction carefully so that 
you will understand and be able to handle the rule. They are 
numerical items, and in all of them you must use calculations. 

In each item* be sure that you understand the rule correctly. If 
you have time, cheek (he calculations, and be sure that the (positive 
or negative) sign attached to your answer to eaeh problem is a 
correct one. Try tc solve each problem correctly and as quickly as 
possible. 

Once you have started a calculation, continue the cslculation 
until you get the answcrrDon*t leave it unfinished and start another. 



Tlie loOottii^ square array of cumbers t$ earned L. 

The lint column of E« | \ I , is csllcd e. . and its second column, 
III iscalled*^- ■ ^ ' 

Each number m a column is called an dtmtsiL In the ^ovt 
example, 1 and 3 are elements of the column c,. and 1 and 4 are 
elements of the column e,. 

ITic operator it indicates that you should subtract from each 
element of the column u'hich comes next to the operator the 
corresponding element of the column which foUoK's, square the 
resulting vTslue, and then multiply all the results. 



Example: 



Consider the above examp!e{s)« and be sure that you understand 
the operction. 

rollarMing tht> rule, compute cadi of the three numt>crs shm\n 
on the next page for iJic square array A, which is given belcu'. 



A = 



3 
-4 

-6 



5 
9 
-1 



'2 
-7 
8 



(I) fla, 3j = 



(ii) naj a^ = 



(iii) na, a^ = 

If you have already Hnislied the above, confirm that you have 
used the operation correctly. Also cheek the c^eubtions, and be 
sure that the (positive or negative) sign attached to your answer to 
each problem is eoneet. 



Don*t turn the page until you are 
told to do so by the instructor. 



Arc there any questions? 



Don't turn the page until you arc 
told to do so by tlie instructor. 
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APPENDIX B 



Scfcn I'lpita fox the Subtest l*C#R. t orrc}5*$?nd!S5 t» I jgnic* 3 
throu^ 9 lor the Subti»t SMB. Intul hems I >cd loi Simtsbtcd 
Tatlcrcd TtMtn^ Arc: 12-2 fi^r I i^rc B3« I ^»*2 JVtrl jfurcB4,l 3-3 
fox Fi^c B5t for V^fc B6. Wjbuh rorr«3>ond* to the 



Coxsbin^lion f icarc% 7 -nd S for NMB. sad » 3 for ! tjurir B9. 
Tboc Arc Ilollcd foi Uk 123 Sabjc^^is Wlio^e d \fi in the 

Inlcn-all-OXO.II- 
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INCOMPLETE ORDERS AND COMPUTERIZED TESTING 



A caniputeri7j^ Japiive ft^ung $>3!eni has three main 
^pcc{s. and consequently it can differ in iluee inaiii uays 
from a noncompuier s>'iicni. First, there Is tlie test Item. 
Full utilbaMon ofa computer zHows an enormous broaden* 
ing in the type of problem iFiat can foe presented to the 
individual. Typing out objective questions to liini is the . 
most obvious thing to do. but it is far from tlie only tiling, 
and is perhaps far from the best thing. Tlicrc is perhaps 
even 2 greater extension of the possible types of examinee 
response, as we can sec not only from what is described 
here but by borrowing t'rom CA! tceliniqucs. Moreover, we 
« can ejsil> incorporate speed of response mlo the scoring, 
we can determine not only whether the person can give the 
answer, but whetlicr he can give it in ten seconds. But the 
greatest difference between eomputcrized adaptive testing 
and ordmary tesimg is in the extent and nature of the 
decision process that goes on between items. 

It is with the latter aspect that I wilt be concerned here 
today; the approach suggested here is quite different 
conceptually than others such as the branching and the 
Bayesian methods, so the paper will trace Us ongins. Tests 
tiy to order persons, so we wiil first considei ilie basic 
nature of orders ^r.d ihen how oiders can be constructed 
from incomplete data. Testing will be shouii to be a l>pe of 
ordering process which utilizes mcomplete dala.computei 
i^d adaptive testmg develops orders from higlilv mcom- 
plete data. will give a simple example of how a 
computer program based on these concepts works. Finally, 
some of the ways m which these concepts fotm the bj>i^ 
for a test theory will be suggested. . 

Our approach to a model foi computerised testing has 
its origins in quite a differeni area, tompuler-interactive 
judgment methods. In order to demonstrate the relation 
between* lesiing and ordenng, let us consider for a moment 
a simple order. A simple order is defined, and please let me 
use quuc-tnformal language, as a sei whose members di>plav 
a relation between elements which demonstrates asym- 
metry and transitivity. Now what that means is that, if we 
have a matrix which records the cxislerice of the relation ab 
a 1, or Its non-existence as a 0, between a pair of elements 
of the set, the mairix must display the iriangulai form 
shown in the first figure. Paired .comparisons judgments of 
some stimulus properly of course often display a close 
appro.ximation to this form. FOr exainple, suppose we used 
the five indicated letters, preseiiled them in^pairs, and ^ked 
a child which came first in the alphabet. Then we recorJ his 
judgment as a I if he responds that the row letter comes 
before the column letter and a 0 if he says the reverse. If he 
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V w X y z 

V I 1 1 r 

•A 0 1 I 1 

X 0 0 ! I 

y 0 0 0 - I 

z 0 0 0 0 - 



tnmutivity and aiymmetiy. 

knew tlie order of the alphabet, liien the data would be as 
shown. 

An interesting properly of such paired comparisons 
matrices is that they need not be complete. Suppose we do 
not ask about all pairs, but do assume that the data is 
asymmetric and transitive. Tlien we may be able to 
complete the matrix by performing matrix algebra on the 
elements which we do have. This is illustrated in the second 
set of figures. The lefthand one shows an incomplete 
dominance matrix, one which incidentally would typically 
be found by the kind of interactive ordering prograni we 
developed, and the right one shows that matrix multiplied 
by itself. We see that in this instance the square of the 
obtained matrix shows exactly the same triangular form as 
the complete matrix in Fig. 1. Actually, the data matrix 
could be even more incomplete tlian this one and still yield 
a complete order. Tlic necessary pan of the matrix is the 
>upiadiagonal chain of ones which corresponds to the 
judgments concerning the letters which are next to each 
oihei in the alphabet. As long as we have these, then the 
matrix can be completed, we just have to raise, itJij^higli 
enough power. Of course, when dealing with human 
judgments with theii inconsistency, we have to build iX^^^^ 
some safeguards and redundancy in the process. 

The reason for going through that exercise is that the 
model wc piopose for computerized testing is exactly the 
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fig. 1, Sufntfcnt jdjjtcnc> matrix Aj,itssqujrc A J and the sum A, 
* A\ , showing that the httcr has the same qualitative form as A. 



*P/cparaUon of this paper-was supported m part by tliw Office of 
Naval Research, Contract No. 150-373.- ^ 
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•amc* We ^a> !ful Ivsu oiAm |V.,p!w\ In uJui sciim: i> tlial 
vw* hi icn-wc fc. l!ic fvii^cwn pc»*plc ^«ne whs 

b aAiiimcinc And iunsiU\<^ h h %u^Kituh]l\ »4»>ii#u> tlui 
if cxamincirs ja* ci%cn Jittcieni s^%»ics. ilicn ilic rcbuun 
between Jhc 5c«»rc%iN jvtininciiiw and tijnMtuc. TIjjt juNi 
a pj**pcrtv •♦! fiuiiibjrv in ij^i ilic oiu- uhiJi icncd as a 
imidcl i'ui «»rdcriiis in iiint pIjvC Bui it ^ pi^'j^iK 
which IS Jusi as Uui *if ihc lesices* zip^^dcs. %»i then sutial 
v2cuiitv nuinbcr>. ui ihjii Un»ihj|| jcuc> rrjinbci>. as it in 
of tlicif ICM >corc-v Wlul is it ah*mi i^^tijc:^ that inSkcy, 
ihc order enipxricaliv meaningful rallic* ihan arbiliai> * 

Te-ii senses siari iiut fruni binar> relalion> bciwccn 
people and iu-nb, IKnv js> il that arc jllovvcd U* dciivc 
front such fclatuMis numbcr% wlndi give us an order of 
people, in ilic^ine a'nsc iliai we van as>igii nunibcu 
siiniuli tlial give iheir order' Where h ilie a>>nuncliiL. 
tranvitive refalion? 

A hmg iinie ago, Louis Gunman ^ve pari uf tlic an2»v^ci 
(Gunman. I'UI). lie said thai iicnii. ordei pe^suas if ihc 
score matrix displays the form we have wuiiic tu vail the 
Guiunmi scale, but >houid more fair!} vail the Guttnian 
Locvingcr scale sin^c !J:c nivented an almost identiual 
concept and deveK^pcd il in a superior wav CLoe\ingcr, 
10471 But Guttman's a|^\cr is mn vMnipletelv ^ti^factui} 
to the foriiiafi>t The s^ore matrix h rcvtangular, not 
square, item responses artf defined a;* riglit or wrong bv fiat 
and have no vhancc la he oilier then a-Nvnunetriv, Tht 
transitivity of a Gunman ic.ijc is indirect. 

The most important part of the ans^ei to tlie t]ucNtionN 
concerning the legitiniav> of Itentsa-s ordeicrs of perMins* 
lies In the realization that the svore nutrix is tml> part of j 
larger matrix of relations. The relations niarrix ib feall> 
itenis-plus persons b> items plus pcrs«ms. not ju^t items by 
persons. W^e think of the response of a person to an item as 
indicating a dominance relation between the pcrsi>n and 
the Item. Habitually, we put a one in the score niatri\ if the . 
person gets the item riglit and a zero if he gels it wrong. But 
ihal is because, being people, we identifv with the persons 
dimension of the matrix. If instejd we were items, inline 
tlirpugli ilie looking glas^ world, we would u^e the opposite 
notation^ giving the iicm a one if the person got it wrong 
and a zero if the dumb thing allowed itself tu be gotten 
riglit by the person. 



lakiDg the p'inl of view uj neiihei iiem\ nm peisoii^ 
but rather of test llieoiiNls. we inusi lake a Icns whauvintMic 
>lariwe .snd plav fail in oui storckeeping. Tlic sci^ii- inainx is 
expanded In the expanded niatrix. \\c give a one lu the 
winnei ol ihe content between item and peiNon and a zero 
' to ihc lo>er, regardless 4jf whuh is which- Suc^i a matrix is 
giien at the left of ilguie In the lowei lell coniei ol the 
matrix we have the usual binar> score liialiix whuh sliow^k 
which items wcje dcfejied bv winds persons. Tlie itiainx 
here is of the Guiiiiian form. In the upper iiglit wt- li^ivc 
I he same iiialrix from the item point of view, giving a one 
each time an item defeats a person. Since the score itialnx 
is complete licie, the upper riglil niainx is the transposed 
complement of tlie lower right one. 

Til ere are two other sections of this expanded score 
matrix and these are left blank. Tlicsc se^tsons correspond 
to the ileni iteni and person-person relations, which arc not 
observed dircctl>* In the case of pairwise judgments, we 
lound above that an incomplete mat nx could be completed 
by squaring the observed matrix. Lci us do that in the 
present vase. The lesuli is shown in the nglu side of the 
figure. It is two triangular niatntes. one for items and one 
for persons. Thus, treated in this formal fashion, we see 
that a GL stale does give two asyniiiicliiu transitive 
relations* vncjui items and one for |)crsoiis. We will return 
♦ to these two order iliairiues in another context. 

We can put the two orders together. Tins is illustrated in 
Figure 4,'tlie niatrix on the leYi is simply the sum of the 
two iiiatrives from Figure J5. ihat is S + S*- The matrix on 
the riglit t»f Figure 4 contains exactly the saiiiC4:leineni&, 
but the> have been rearranged, that is. pre- and pi>slinulli- 
plied by a perniutaliun matrix P. into the order which is 
implied here, a ftwa tirder of persons and items, which is 
seen lo in lad be a simple order because ol the iriangular. 
i.e.. asyinnieiru and transitive form of ihe matrix. This 
answers those ijuerulous questions about where the order is 
in the case of test data. If the data are a (iUtinian scale, 
then the sw*re matrix, expanded and operated on in the 
manner ind!.,aled. does mjecd define an order m the rather 
strict sense uf the exisienue of a relation on a set. a relation 
which Is transitive and asymmetric. 

Let me say that for illustrative purpt»s*:s here the matrix 
operations have been carried out in ordinary arilhnieiic. 
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f ig- 3, C omplete (Hliowin^ n<Jits ami \^Ton^si ^wurc mains S for two 
items a, h and tlircc person*. 1.2. 3 for ^calabk data, and S* 
sliouing i(cm*i(em and per»on-|H"rson dominan^^e. 



I ig, 4, S ♦ in its onpnal Mrgrc^galcd form Ocftj 2nd luordercd 
form <n^)th the lallcr showing (|tialilativc avymmelr> and (ransi* 
Uvity like a simple order. 
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Because iht ichimis jic h^^a^J mihci iluu %c 
should have been doing the nmtnx miHiUfhcMon %uh 
Boolean anihmeuc. Hie only Onn^ i!i3! Juij^n ifi ifjc 
presen? ctnuexi i\ rful al! iiujnhjji> pt^ict ilhn ^mc m fhe 
matnces diould he >el equal Hi one. 

St> lar. lie fiaie not rclcrieJ Jiicctly ttiafAihinaha^in^ 
lo du with •Vomputen/*:d adapiiie texiina/' ^ut ilie 
rek^aiice »*! ihe above theofeiii^l ^keldl ^^uiic dut^r 
Just as the score niairix iivell is a kind inconip!L*ie 
matrix of doxnmancc rehiKms thai caji He i:ompLn^*d b> 
die pmvcring oijeraiu^n. aa even nu^re iiic»*iiip5ete M:t ol 
relations is al! that is really necessaiy t.* define ihe ji*inl 



pw*i*»pn-iieni *?ider« U we happen t^> avk e^ch pers^-n ^-nh 
if:e hardest item he ^^n answer correvih and tlie ea<ie-i 
iieni he uoald ml^s. those 2n relations actually, l:i-2 is 
enoudi are vullicient lu dellne the complete joini *^rdef »>i 
Items and person*. This mh^l o! relaitons i:an t|une Mmp ly 
he sJioun to wi*rreNp«>nd t«« the relaiiuns between adjacent 
elementN m the order, the siipradlagonal string »*f «*i*eN we 
saw in the incomplete paired comparisons matrix ol f ig. 2 
In fact, if > ou look at tlie ncJithand matrix Figure 4. the 
siring i>f ones jusi abin e the diagtmal there denotes exactly 
thts lei of item-j»ersi»n relations. In the 1^75 BuUciin article 
(Cliff. 1**^5| I illustrated the way in vvhich such a set *»f 
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1 1 • 5. Illusjnuion ol tompIcJion by powcriiig, Siaired cntnct axe Jcnwd by miplicaiion. 
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rtlalicms «x)uld \>t used lu xeaimtruwl ihc viiznjpQete ^^le 
maiiix. Tlut prouss Is icpioduwcii htit in fj^ic 5 ^hac 
the matrix powering is earned ou j. 

Unfortunalcly. ihcrc is a problem, wc do nut kno% Ihe 
ri^l items to ask i person un!il aftct wrhavc asked ihcir.. 
The routine by which the computer searches foi the right 
Items to ask is one of the two tsmn 2spects of the 
procesfing par^ of computeiized ^ixp^j^t testing, the olhei 
nijun 2^>ect being how it damps out erroi. In cu£ reseai J;, 
"^*hat we are dtwng is carrying ovci some prinuples whll. 
we have preiiously found to be effectl\« in the p2u:ed 
comparisons ordering case. 

Tlie next 5ei of figures iOustrate the operation of a 
prototype program of the kind we have in mind, svzitten b> 
Jeny Kdioe. i^t, the program adcs each person iwo itcrns 
at random. Hie entries in the lefthand matrix of Figure 6 
show the results of these preliminaty rounds and the 
righthand one ^ows the powered matrix which contains 
the implications of these responses as well as the responses 
themselves. So far these are vei> few. The computet then 
decides which items to ask wliich persons next b> seeing 
Hiiich are closest together jn the order far determined. 
This process of presentation, powering, and Section ivould 
go on for several rounds. The next figure sho^s the icurc 
matrix for an intermediate round on the left and the 
imph'cations on the right No% the powering prouess rs 
having som^ effect. The next one shows the final icoic 
matrix on the ]eft and the miplivatiuru on the nghi where 

items prisons 



abcdc f^ 133456 
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b 0 
c ' 0 0 

' d J 
c 0 0 

f 0 0 1 

% 0 0 1 

1 II 

2 i 1 

3 ! 1 

4 0 1 

5 1 1 



»c ^ tibial out uxu> h^ ihe >kA9ic nutrix been ojirz^kted 
\syj iniplkaiiun but thtrt arc n\A% «^n;pkic,ftimpk mJers JL 
persons and items. * 

inudenlallv dt^ul havi^^ name fur thi* mciKud. We 
wuuld like t« call itlSe Extended TraruiUvity Syviem^oi 
EIS, but those initials hai^e beoi preempted. 

You can see that the savings are not rery great in this* 
instanwc. caJi person muai be a^ed must of the item^ This 
impression is pnmarily a funcliun of the iA the data 
matrix here. The savings arc much, mu Ji greaici with large 
matriLes. An buund for the numbei of riem-pcrson 

relations that must beobser^rdforiiperso^andjcitemsis 
JogjCo * x)I. For 200 persons and 200 rtcms thts numbei is 
about 2886. Tlut means we would need tu ask each person 
only 15 items to get the .complete order, morcmei, th» 
upper bound ii^^quite a generous one m the prcscnunslaRvC, 
a couple fewer might well be sufficient. 

Thus the methodf will work if the responses form a 
Guttman scale. It «vorks ^urpiismgly qaickl> and requires 
surprisingly little spat« m ihe compbtei, pnmarily because 
the programs lake advantage of the binary nature of the 
data to store re^onses as smgle bits and then to carry out 
many of the calculations on $vholc woids. that is* 32 
elements at a time arc processed m raiding the mainx. to the 
next power. > j 

It is reall> no surprise that it works tvjth enodeis data. 
The crucial questions are how wfll wilf it work with the 
kind uf InwonMstenl item& and persons (h43i the ;cal woiM 

item? persons 
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6 0 0 , 6 0 0 0 

Pg. 6- (lxft> tnitiJ item /t*punsc> mjliix S, sXwmmy bwth person Jv^/Timancc^ 4nti «tem Juiniainccs,Bunkcnuic>ind<u«tc itcm-pc/sun pain 
not yet objcrvcd- (Rifht) S + S' , showing the implied hctn-iicm and pcison-pc/son dominances, ^ 
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fjLCi wiih^ and wbzi &i\2atz^ doa^ A oila uvea olbci 

the opponum.y lo icsi ji Gci wiih ^aitifioil ^todia^Uw 
and ihca »ilh «T2I ^^u. How tv-dl it wiH do in pxacUce 
fdalh'c to ihc othei ^ppxoaJst^ tlui h^vt been fcptnlcd 
and »1udi »-£ heamig about duns^ ihcst livu dry^ must 
zw£i even fuitbcr data. 

A phori^ iht (nethoduS^gy hert ^pean to ofTex atlsa&t 
one potential advant^c, tht avo^danvc of cxtcnave pie 
testing to detennme item diaiavtemtic^. Such pretesting 
presented probSeim. even to papei and penal testing There 
was the secuiit> prot7kin« the question of aimpaiabibr of 
populations, the dinexmg contexts, the exj^nse iisdf. In 
the computerued situation, these all become mure atuu. 
The present process avoids pretestiz^ 5inc3e items and 
persons are processed in parallel. 

This method do^ require a substantial nunibei 
persons beinf tested sumiluneously, howe^-er, but ihis is 
on]> initially true. Once a substantia] -set of person^tem 
relations has been built up, additional persons, caii be 
processed individuaJly as the^ appear, bdn^ lit into the 
previously detennined order by means of thdr itsyxast^ to 
the items. Under that "^mode of operation the amount of 
addtllond computer processing would be quite smalL 

It also seems to me th^t this way of thinking about 
tailored testing makes it easier to tMnk of tesUng as 
integrated into a total personnel process. After all, it could 
be that the item selected for a person at a ^ven point could 
be something like, **You have been assigned to wddets' 
school Come back when you have completed the course." 
The "item** in that case is suuxssful completion of the 
course. 

But to, me, the most promising aspect of this method is 
theoretical. It furnishes the bans for a test theory wKch I 
think is more appropriate to the computerized testing 
context. If what is wanted from testing is an order of 



persons, ani rrorms after all /ust Id! the mdhiduils* 
pos!L«ms aehti^r u* ^me benuhmark persons, then surd> 
»e »ant the order to be consistent and ..ompSetc. llm do 
>0a tdl if the ordet is ix^nszstent and comi^ete? Ym look 
at the person person rdation matrix and see if it is 
zsymmetiic and trarzative. It is ea$> to thiiik uf indites 
-fthldi would reflect the degree to whidi that matrix has 
those properties. Indeed, I had 'mltn^tA to spend my time 
here today talkmg about them, but the results of oui stud> 
arc not quite read|> foi presentation yet. Sudi indices 
furnish arulogucs of th^ fan^Iiai Kudei Richardson for 
mulas whidi are central to basse test theory , and m fact are 
related to them in the 4:asc of complete data. Thcj haix the 
additional ^vopcny of beicg readily ^eralizable to the 
incomplete or ccsnputer adaptive case. Thus if we g3 about 
computerized testing in the way described here, we can at 
least have appropriate evaluational indices built into the 
system. Other tailored testing schemes rdy on octemal 
information from ttaditional modes of testmg to get thdr 
bxserial condations, item difiiciJties, reliabilities, and so 
on. Here, analo^^ of these indices will come out of the 
intersctive processitsrif. 




-Qin; K. CoropSete dfden from isconspkie dau: tctcxactire order- 
iag and tailoicd tcrtL'ig. rsyckoiopcd BuUctitu 1915JS2^ 

Guitnun, L. The quantiHcation of a dass of attributes. A theoj> 
and method of scale constiuctlon. In Hoist CEd.)« The 
prediction of penond a^usimaxL New Yoric: Social Scicncr 
Research Council, 194K 

Loeruifcr, J. A systematic xg^msix to the constiuction arid 
£V2lu2tioA of Tests of Abili!>. PsychcHopocXMonopaphi^ 1947, 
61(4,Whc^pNo.aS5). 
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ADAPTIVE TJESTING 3R.ISEARCH AT MINNJESOTA- 
OVER.VIEW, RECENT RESULTS AND FUTURE DIRECTIONS * 



Adapthc Testing and Error Reduction 

The genera] objective of uui leseaidi piogiam on 
adaptive testing h to view it fiuin ^ pcrspe4.tne whiJx 
identifies several souiixs of potential enux in test ,scores« 
and tu stud^ adaptive testing as a n^cans fm xcduung thc^ 
eiTOis of measuttment * 

The first fwicral sounx jof error that we hrk'c been 
concerned with fot some amc xs the errox thai results fium 
the mismatch of item difiiailties tn an abiht> teu vvith the 
individual's abilit>« ObYiou53>, the lestees ability is noi 
knomii aLthe 5tart_uttc$iing. But the diffeicui sUaicpci of 
adaptive testing that have been piuposed can be viewed as 
different ways of malJimg itetn dlfHcuIues wxih tcstec 
abiHty ?nd sequentially estimating the t^tee*s ability. 
Consequently^, one uf uut majoi focuses is tu determine the 
best, ox at least bettei, ways of adapting jicm difHutilucs ty 
individual abilities. ' 

We are approaching this in two complements i> ways. 
First, we have been doing live computeiized testing. Since 
late 1972 we have tested more than 5,000 subjects on a 
vanety uf strategies of adaptive testing. But live tcsiiii^ 
cannot provide the amwet to all the questions concerning 
which strategies are best undex which conditium, occau&e 
there are xoo many quesUuns tu be answeied. Hieicfuxe, wc 
are using compuie« simulatiun tu supplement and cAtcnJ 
the results that we obtain from live testing. 

Out general strategy is to imjdcment an adapUvc testing 
strategy in live testing to obtain some data with an 
arbiUiily structured live adaptive test data as 
diaractenstics uf score distributions and lci»t rcte^t 
reliabilities. Thpn, oui ultimate goal is to build a computet 
simdauun model which will accuiatcly reflect the results 
that we obtain from live testing. With the vomputei 
simulation model w-t can then veiy rapidly study diflcxent 
vanations of the ^^^nst testmg strategy. The nexi step is 
Xo verify the emulation results in live testing. 

Thus fat wc have not yet developed a simidation model 
wMvi^ vompletely lellewts how Jivc testees icsponJ, but wc 
are makmg pxogress towaid that goal. The computet 



' Early dcvrlopment work on this icsezich was sipponcd dunni; 
1969^d 1970 by loanu fiwnxibe General RcscarJi J und of the 
Graduate S.hool» Unrvcrsit> of Mmnesou. Research /cported m ihu 
paper was s^ppoiitd met early J 972 by Personnel and Training; 
Research Pro^arm. Office of Naval Rescardi, Contract Ko. 
N(X)014-67-A-01 13-0029, NR 150-343. Special thanJcj arc due to 
John DcWitt, bur project programmer, without whom th» lescaich 
' would have been almost imposnblc*. 
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simulations aie nec&ssaiy because of the rapidity with 
i^iiich we csn study van^ms alternatives. The Ii^*e testing is 
r^ec^ssaxy, obnoudy, because it s people who take lests and 
not iXTmputers using hypolhetiutl items or hypothetical 
subjects. So it is Decessaiy to re verily the results of the 
«.omputei Emulations to male sure that thc^ still reflei,! 
what real people do ^ven the %'aiia!lons we have made In 
the strate^es studied in the simtjlations. 

The second main focus of our research is a concern with 
the psychcdp^cal effects of adaptive testing. Here we are 
.seemed with identifying |he psychdlo^cal aspects of 
iesting,and the test eninronment which can inttodu^e errot 
into test scores. These variables indude guessing, test 
anxiety, boxed omj frustration, and racial oi ethnic group 
effects. 

Guessing can obvioudy arlifically increase test scores, 
frustration, anxiety, motivation and other iawtoi^ can result 
in test scores lower than true ability. All of these, therefore, 
2XC purees of error in test scores which are due to the 
psychological effects of testing. 

We are also concerned v/ith the psycholo^Lal effects 
that will result from the man-machine interface. Tltis, from 
out cxptitwHce, is gdng to be an Important problem in 
«.omputcri^ed adaptive testing. There ate different lands of 
<A>mputei systems on K'hJch we can imf^ement adaptive 
testing and each of those computer systems Im it^ positive 
and negative effects on testee behavior. There are different 
kinds of temiinal deWces fot adaptive testing and t^h. kind 
of terminal device displays in different ways and at 
different .speeds. All of these variations in (he man machine 
interface are going to be nevy problems fot us t;i conadcr in 
the years to come. Past research has demonstrated that 
amv^ei sheets in papei and ^nJt\ testing sometimes had an 
effect on test scores. SimHariy, research in adaptive testing 
will need to study different kinds of jCRTs, diffeicnt kinds 
of computet systems and different display speeds as part of 
the psycholo^cal effects of computerized testing. 

A third source of erroi (hat we arc concerned with has 
been briefly discussed this morning by Dt. ^amejima, (his is 
enoi that results from not extracting enou^ information 
from a lestee*s response to a test item. To date, most 
psychometric research has been concerned with binaiy or 
&1 scoring. But, as Dr. Samejima has indicated, we can get 
moie information out of a test response if we treat it as a 
graded item. Our research extends that reasoning to 
continuous responses using the continuous case of latent 
trait theoiy. The continuous case is operationaltzed by 
probabStstic responding. 
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This isptci of iAst research is amocjncd with in!^lxn^ 
piobabilisUc responding' with adapllve sestin^ ProbablGslic 
•espoading, like adaptive testing can result bx bonzonia! 
anfonnalion fiinctigns. This implies that if we put ^dapti^'e 
testing and probabilistic responding together we w31 have 
cxtreindy pow*crful jnethodi of redudng errors in :c$f 
5^res due to the incomplete use of test tcsponxs. 

Hie fourth source of error ihat we are studying is the 
error tliat results from deiiations from unidimen^onslity 
Latent trait theory, as it is usually used in testing, is based 
on the assumption of imidimensonality, althou^^ there are 
muIUdlmenaonal latent trait models bang developed. But 
dimcnsonali^ that is defined on a group, such as the 
unidimcnsionali^ of latent trait theory, does not 
riecessarily hold true for an individual. That is 
dimesaonali^ defmedb> fawtut analysis ui other methuds, 
when applied to an indhidua!, assumes that the indrvidua! is 
the t>fncal or average member of the group on v^hith the 
dimen^onality was deflned. 11ius,in the testing ^tuatic^, 
when a set of ^nrdimeruiunal** items h administered tu an 
indh1dual» result cizy be a^t of responses that are not 
tmidimenaonally determined. 

Consequently, our research is concerned with 
individual item pt^I Inttiawtions the intera^tiun of one 
individual with a set of "unidimcnsianal** items. Wc 21c 
5tud>mg item lapimpt prutu^uk of thrs ruturc to 
detemune if meaningful doiatiuns fiom unidunensiuna!it> 
do occur fur ^peviflu mdrviduaL. If ihe> do« wc wil ihcn 
attempt to develop iritcrautrvc testing n«ode!s Ui^i v^il ukc 
account of mtra-mdividual mu]tidimc:nsiuadlit> m ^ 
adaptive testing ^tuation. 

The focus cf our rescai effort, as >ou can see, is with 
the indMduaL "We are corKemed with identifying those 
sources of cnoi in test s^-oies ivhich result in the over- 01 
under -estimation of caJi inJiviJuals abJit>. 

Recent Results 

Most of our rcccnt results arc concerned wtth the 
psychometric effects of adaptive testing, or the companson 
of bianchmg strategies. Thus far wc have reported rniUal 
results from both hvc iesunganj «.omputet simulation on 
simj^c two^tagc rest (Betz. & Weiss, 1973, 1974, Larkm Sl 
Weiss^ 1975) and a p>ramidal branchmgstrateg> (Larkm & 
Werss, 1974, 1975>- Below, 1 will report some results from a 
flexilevel test (Bctz & Weiss, J975> and some data on uj 
stratified adaptive lest OVciss, 1975j. Mr. McBndc 
present >umc dau »isu)g O^en':^ (1975) Bayesian Captive 
testing strategy* 

In general, the flndmgs that wc have to date show tibial , 
adapuve tests ^lavc hi^ei lest rc test stabiLties a vcr> 
pracUcal and useful cntenon when controlled foi number 
of items and memor> effects. Adaptive tests also tcatd to 
show, m simulation studies, better distribuUons of ability 
estimates. That is, ability estinutes better reflect the 
distribution of generated ability. And, in general, adaptive 



tests gne mfoixnaUon fundjons whiii. arc Jess ir^riablc 
thiou^out the ability iat!gc, m support of Lords 
theoretical findings (see Weiss & Bete, 1973). 

FkxdcpeJ abHitj lestmi^ Figure I ^ow^ the item 
Mnicture for Lord's {197Ia,b> flcxilevel lesU Iri this icsliog 
suategy there ts one item at each of a number of dif^kvlty 
levels, item 19 is the most difOcult item and item IS the 
least diflrodl item. Everyone starts the Hexilcvd test with 
an item of median difficulty. Items ^ith odd cumbers 
increase m difiicalty as they deviate from the medha, and 
items with even numbers deaease in difflculty. 

Figure 2 shows the paths taken fay three different people 
through a ten-sUge flexilevel tesU Starlmg with the first 
Item, a correct response leads to the next more difHcult 
Item whitL has nof yet been administered. AVi incorrect 
respond leads to the musi difrlculi of the una Jmmiste red 
easier items. Figure 2a ^ows a high ability testee goirsg 
through a flexilevel test,' Figure 2b b for an average ability 
testee, and Figure 2c is for a low ability testee. 

Our hve-tesimg study of flexilcvd testmg(Betii'Weiss, 
1975> used a flexHevel test in which each testee would 
ansuer 40 items, requiring a 79-item structure. That test 
and a c<HiventionaI peaked paper-and>pcncil type test, 
admmistered on a computer to c^trol for novdty effects, 
was admmistered to 130 individuals. Ihe same tests were 
then used in a computer simulation study. That study used 
10,000 "subjects" sampled from a rtormal distribution of 
ability, and an additional 1600 subjects, 100 at each of 16 
Ic^eL of abili!y. From these Emulation dau v>t calculated 
information functions, and test retest or parallel forms 
reliability- From the live-testing study we calculated 
test-reteit relLibilrties, and other data describing score 
distributions. 

The major result from the Ihe testing study was that 
ilexilevel test^ores were no more stable on retest than 
scores on the conventional test; test*re:est stabiliUes forlhe 
two were ^rtually identical. The m^or| result from the 
simulation study is ^own in Figure 3, which displays 
information functions for the conventional and flexilivd 
tests. Figure 3 shows two findings which were nofpredicted 
by test theory. 

First, test theory (e.g-. Lord, 1971c) predicts that the 
conventional test i^ill always result ita higher levels of 
mformation, i.e., better measurement, than any adaptive 
test at the median of the ability distribution. Figure 3 
shows that the flcxOevel test had higher levels of the 
information function at the median (6=^) of the ability 
distribution. The second prediction from test theory (Lord, 
1971b) was that the flexilevcl test should yield a relatively 
horizontal information function. Figure 3 shows an 
mformation function for the flexilevehtest which is quite 
divergent from honzontal. In fact, the standard deviations 
of the information functions show that the ilexilevel test 
had a larger standard deviation than did the conventional 
test, that means that the flexilevel test tended to be less 
equi precise than the conventional test, at different levels of 
the ^ilityjdistribution. 
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A comparison of the icsylts from the computet 
simulation study and the Hvc-testlng study showed 
diffcicnccs in the test rctest reh'abilities. This result was 
expected because of the memoiy effects In live testing- 
There were ako differences between the two studies in the 
shapes of the generated score distributions. These 
diflerences demonstrated that the simulation model was 
not yet adequate enough to reflect the results of live testing 
and that It needs some revi^on so that it will enable us to 
extrapolate from live testing through ujmputei simulatiuh 
and back to live testing. 

Another interesting result from this simulation study 
relates to the methodology of comjnjter simulation itself. 
The design of the study was one in which we repeated the 
computations for a hundred samites of a hundred subjects 
each In order to study the sampling distribution of the 
simulation results. This was done to examine the generality 
of findings from computer simiiation studies winch use 
JOO or fewer simulated subjects (c-g., Jensema, 1974, Uny, 
1971)- We found that estimates of validity, the conelation 
of fcnerated ability with estimated ability, based on 
samples of 100, ranged from .87 to -95, wilh a mean of .91. 



In 4;ertain intei-strategy u>mpatisuns different conclusions 
about the relative utility of a testuig strategy might be 
drawn based on validities of JB7 or .95. Thus, simulation 
studies should be based on samites of more than 100 in 
order to arrive at stable condudons. 

Twa-stage testing. Figure 4 shows a computer report 
from what we have called a continuous second-stage 
twcvstage test. This adaptive testing procedure was 
dcvdoped by Brad Sympson of our research staff, we Jatef 
discovered that Fred Lord had independently develc^ed the 
same testing procedure. In Fall 1975 we tested a number of 
cdlege students on this contiinuous second-stage test. 

The jn^or problem witJi Vwo^Uit tests as they have 
been used in the past (Weiss, 1974) is that of routing cnors 
made in branching from the routing test to the 
rncasurement test because of errors of measurement in the 
rcKiting test. To solve this problem, we developed a 
measurement test stage which con»sis of a number of very 
diort measurement tests.- The example shown in Figure 4 
used a 14-item routing test and 25 4-item rncasurement 
tests, each at a different level of difficulty. Using this 
adaptive testing procedure, when an individual completes 
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the routing test his score is determined and that score is 
used to choose an appropriate measurement test. Then, to - 
reduce routing errors, a number of measurement tests on 
cither side of the diosen measurement test arc also 
administered to the individual. In the example shown in 
Figure 4, the individual's score on the routing test 
.estimated his ability at 1.4 standard deviations above the 
mean. ConsequcnUy, the most appropriate measurement 
test was csrimatcd to be number 18, which had-ilcms at 
difficulty about 1.4 standard deviarions above t he mean. 
But, to compensate for possible errors of measurement in 
the rouUng test, he was also administered items in 



measurement tests 14 through 17 and 19 through 22, for a 
total of 36 measurement test items. These itcnfs varied in 
difficulty from about .25 to 2.25 S^D/s on the difficulty 
continuum. 

Following a dc^gn that we have used in a number of 
other studies, we did a tcst-retest jive-tcsting study with this 
continuous second-stage iwo-stagc test (in which each 
testee completed 50 items) and a 50-itcm conventional 
peaked test, over about a five-week period, with 104 
tcstecs. To keep scoring method the same for both testing 
strategies, maximum likelihood scoring was used for both 
the two^tage and conventional test. 
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The 5tud[> dcsa^cJ ^sv iu equate the loiin^ 
proc^edurc^ fux 1} ilcin Ji^flininaUuii^« 2) mcmoiy cHcv^^. 

3) numbei uf iiems. Mtniui> cffcwU ^csc equaled by 
lust dctenniiixng ihe ^umbci of iicim caJi inJiviJu^ 
repeated on xcicsi of the t%fra^ta5e tc&i. Then the rctc&i of 
the conventionaS tcsi was ^truaured lo hav^ the i^une 
humbei ol ^pcated heihs b> inching Uti appfupriaic 
number of new items- 

The test-retest conclattun was .94 fox the voniiouous 
two^tage test and .66 fux the equivalent cunventiuna! tcsi. 
Sinve the diiTercniX in ^tabilxtics was wonsidaabl> laigci 
than found in out previous studies of conventxunal 
adaptive tesUng sttate^es (c.g^ Bctz & Weiss, 1973, 1975: 
Larkin & Weiss, 1974), we carcfull> examined the 
dbtnbution of ^.omentional test ^voics Jenved lium thu 
maximum likelihood sconng. Six tcstees were found with 
vciy low abilit> scores, appaicnil> due to guessing on tht 
(conventional test. Dau fox these testecs were eliminated 
and the test-re test a^iielations were reca]i,ulated. The 
stability correlation fox the two-stage test was .93 and the 
conventional test £9, This result was similax to that 
obtained in other comparisons of conventional and adaptive 
strategies* showing a higher tcstrelcst con elation fot the 
adaptive lest than foi the peaked conventional test. This 
result was obtained when both testmg strategics were 
equated for item discriminations and mcmox> efTccts. 

Strsdaptivc abdit} testing. Tlie stxadapiivc testing 
strategy (Weiss, 1973) is based on a scnes of peaked^ tests, 
each one diffeimg m terms of difnwu]t>. Fi^re 5 shows the 
distribution of item diOiwuhies fox a h>poihctiuaI 
stradaptive test. In Figure 5 there are nine siiata, c^ch of 
which IS a peaked test peaked at a different level of 
difflcuhy. 

Figure 6 shows an exdmple of an individual moving 
through a stxadaptivc test. Testing begins with an item at 
some point on the difliwuU> wontmuum, the entr> point is 
estimated by prior information about the testee. The 
individual shown in Figure 6 began with the fiist item at 
stratum 5, an ilem of aveiage difficulty. Since he answered 
that Item corrcwtly, he was admmistcred the first iicm at 
. stratum 6, whiwh consisted of slightly more diflHwuIt items. . 
Following the same bianwhmg rule a more difTicult item is 
administered followmg a correct response, and a less 
difficult item foUowmg an incorrect response the 
stradaptivc test wontuiucs unid the tcxmmation wntexion is 
reached. The test is termmated when a sixaium is identified 
at which the individual is responding at ox below whanwc 
level (i.e., 20% ox less correct) based on a minimum of five 
Items administered at that stxatum. The individual shown m 
Figure 6 answered five items at stratum 8 and none of them 
^ were answered correctly. Consequently the test was 
termmated since further testing was likely to provide little 
additional information on the testce's ability leveL 

Scoring of the stradaptive test results ^n both ability 
level scores and consistency scores. Ability level scores 
reflect the individuals position on the abflity scale. 



consistency scvx^^s reCcU the sanation in item difHuulties 
diwounieied as the indindoal goc^ thiOu^» the stiadaptivc 
icst. Hguxe 7 shows the stradaptive test response rewotd fot 

ini^nststcnt individual. This person staxted the test widi 
A relativxly difliuult item at stratum S but answered some 
easy items incorrectly (e.g., items S and 26) and some 
uifTiwulL iteihs worrccdy (e^, items i and IT). The rcsuk 
W2s a response record which varied widdy across six strata. 
A vomjpaxtson of the consistent scores for Figure 7 with 
those uf Figure 6 shows the formex to be uniformly higher. 
Thus, the testee depilated in Figure 7 was more inconsistent 
in tus interaction vdth this item pool than was the 
individual in R^re 6. 

Our Ii%ie testing test retest study of the stradaptive test 
was based on about 200 sul^e^ts. Ovei an average five week 
period the test retest reliability fox the best method of 
scoring the stradaptive test was .90, the test-retest 
reliability fox a conventional test using the number of items 
administered on the average in the stradaptive test (23 
items) was ^6. This result showed about the same 
difference in favor of the adaptive test as we have obtained 
with other adaptive testingstrate^es. 

I hid hypothesized cariicr (Weiss, 1973) that v^sistenc^ 
scores should reflect something about the dimensionality 
that results from an individual s interaction with an item 
pool. To extend this hypothesis, if an individual is 
responding unidimensionally hb scores should be more 
xeliable tlian an individual whose interaction with an item 
pool is multi-dimensional. In operationalizing tliis 
hypothec, consistency scores were used as an indicator of 
dimensionality, and test-retest stability as ar^ estimate of 
reliability. Spedfically, testecs were divided into five 
sub-groups on the basis of their time 1 consistency scores, 
and test ret^t reliabilities were computed separately Toi 
each of the five sub-groups. The results are shown in Table 
1 fox consistenc> score II, the standard deviation of items 
encountered. 

As Tabic I shows, the highest test-retest stabilities were 
observed foi the veiy high condstency gioup fot all ten 
methods of estimating ability within the stradaptive test. 
The cicarut pattern emerged for ability score I. On that 
Score, the stability for the highly consistent tcstees was^^24, 
and that for the veiy low consistency group was .65, with 
stabilities for. the intermediate groups decreasing with 
decreasing consistency. The possible utility of consistent^ 
scores as a moderator variable is that it might permit us to 
make more stable predictions fox some groups of indivi 
duals (consistent testees) than for others (inconsistent 
testecs). Particularly noteworthy is the test-retest reliability 
of ,98 for the vexy highly consistent testecs on ability 
scores 8 and 9.^ 

If these results can be replicated over longer periods of 
time, the consistency score might prove to be a vexy useful 
and powerful moderator vanable derivable froifi a stradap 
tive testing response recotcL It appears to be powerful 
because it also moderatesUhe (est retest reliability, but not 
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as systcfxiatically, oo (he conventJona! test administered at 
the same Ume. Tabic I shows a test-retest leKability of .979 
<Mi the con^xntiona! tcsi for the highly consistent group 
using ihe consistency scores derived from the stradaptive 
test. But consisienqr scores arc not derivable from a 
-conventional test so it is necessary to implement this 
finding within the framework of the stradaptive icsling 
strategy. 

Rgurc 8 ^ows a number of **subject diaractcristlc 
curves/* which arc derivable from the stradaptive test. 
These *,urvei, whivh leflcvt the mdividua!*^ vurui^icny) %jf 
interactii/n with a :>trad^ptiv& tcM, are ba^^ed an a plot uf 
prupurtiun cuirect fvii each individual at each >Uatani oi 
the 6tradapUvc tcii. Fui tAan.pIc, cht. plut fui ^lUidiu \V." 
^huws that he answered all iterm vuncvtl> dt buih j^tratum 
5 and zktraiurt) 6, jbuut half vurrevi at stratum 7 and nur^*. 
woirect at stratum 8. Smcc prupurtiun vuirect decreases 
munotunicail> with m^rca^mg item d2fTiuuU> this jndivi 
dual appears to bt micravting with this item puol unidimcn 
Mona]l>« Wilbam W. is a highlv <.onsistent individual, B> 
wa> of contrast, the subject characteristic viirve fur Xarol 
C** docs nut decrease monotonicall>, reflecting an inconsis 
tent individual whu answers items correct]> at a vanet> of 
didiculty levels. 

Tu be useful, these subject chaiactenstic curves must be 
stable across time. Tu investigate their stabi]il> aciuss an 
average five-week re test interval we computed canonical 
correlations between pioporiions correct at initiJ test and 
at retest. The complete redundancy analysis showed that 
67% of the vanancc in retest subject charactenstic curvc^ 
^v/as predictable from initial tesimg.'^This is equivalent to a 
suu.^'ed multiple correlation of .82 for predicting mdividuai 
proportion cur>vct at Time 2 from a best- wqgh ted hnear 
combination of proportions correct at Time L These results 
imply that subject charactenstic curves arc reasonably, 
stable and that tlie> may represent a stable trait of the 
individual. But, certainly, mure research is needed. 
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Proporiioa t-oriect at each j^tratum, by indii-iduai 

In addition to t^iis live tcstmg^ study uf the stradaptive 
tti^t, we also have M^me recent data from a computer 
^rniulation study. Items t^ith constant discnmmations, and 
Jiffiwuhics icvtangulailv distiibuted betv^een normal ogive 
diffivulty values of 3.33 and 3 J3 and grouped into nine 
equally wide strata were used for the stradaptive test. Items 
with vunstant discnminations and with difficulties rectan- 
gularly distributed between J3 and .33 (equiv Jent to the 
middle stiatum of the ^iiadaptivc test) were used foi the 
^conventional test. 1000 Ss were generated with abdities m 
the ^ven interval at each of 13 intervals of 0. Major 
findings are shown in Figure 9 and Table 2. 

Figure 9 sliows the information funcUons for the 
stradaptive and conventional tests at two different levels of 
item discrimination. At both levels of item dbcrimtnation, 
the information function for the stradaptive test was moxc 
horizontal than that of the conventional test« with 
difference more pronounced at the higher level of Item 
discrimination, in confirmation of Lord s theoretical pre 
dictions, the .conventional test has a higher information 
function than the stradaptive test at the center of the 
ability distribution, but the range of superiority diminishes 
^th Increasing item discriminations. However, the informa 
tion ^unction foi the stradaptive test increases with ability 
level, and for the lower discriminating items, the stradaptive 
test at 0>2,5 yields a higher information functiort tJian the 
highest value reached by the conventional test. 
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Infoim^tton luncliom for 60-1(^01 Tests 

Tabic 1 sliuwi validities ^-uircldtiun^ uf ability e&timatc 
and generated ability -Irum tjic Mmulatiun data on cunvca 
tional and siradapuvc tcsii. Validity curielatiuui aic ihuwn 
as a function uf buth item discnminatium and numbct uf 
items^ Tlicsc results show a slight ^uperiuiity in validities 
for tlie conventinal tests when item diS(.riminaiiuns ate lu\V 
(a=^.5), and there ♦ic 40 oi fewer items m byth tests, a 
simdar result is fuuad for lO-itcm tests computed ufitcnu* 
at a=J.O. In all other i^onditions, tlie stradapttvc te^t yields 
higher validity, with sizable differences appearing as 
number of items increases and discriminations increase. Fur 
6ait^m rests at a=2.0, the validity of the stradaptivc test 
was r=.989, while the conventional test validity was only 
.926. 

Tlius,,the data from both the live-testing study and the 
simulation study of stradaptive tests show that the stradap 
live test yields scores which are more cqui-p^tds<racross the 
ability range, and have higher validities and rrtitibilitifis than 
conventional tests under certain conditions. Further, the 
stradaptive test consistency scores appear to be powerful 
moderator vanables which xnay have important pradival^ 
applications in testingindividuals. 

Psychological effects of computerized adrmtiistration. 
One of the psychological vanables that has been unsystem- 
ati.cally manipulated in computerized testing studies has 



TABLE 2 

Stoir Ability <^ircbt)<m$ of the StrauI^lH'ir B^csian Score a!sd 
Otc CojivrtitionaJ Test Score for Tests of 10 to 60 Items, as a 
Funviionof Item IMscrimination 
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been feedback or knowledge of results. In computerize 
testing wc now have tlie capabflity to tell an mdividi 
whether his answer was correct or mcorrccl after each Utm 
in a tcsL ButJt is possible iliat such imi^ediate knowfcdge 
of results might have an effect on test scores, Thfo, we 
designed a pilot study to systematically manipulafe feed- 
back and study its effects on test scores. 

We administered two tests on the computer to a group 
of mner-city high school students. The group was racially 
mixed, consisting of both wliite students and black stu- 
dents. Both a conventional test and a pyramidal adaptive 
lest were administered to each student, and half the group 
received the conventional test firsthand half received the 
adaptive test first. In addition, half the group received 
feedback after each item and the other half received no 
feedback after eacW test item. We analyzed tlie data for the 
conventional test only thus, Uic dependent variable in this 
analysis was number correct on tlie conventional test The 
design was a 2x2x2 analysis of variance. The independent 
variables were 1) race black and wliite; 2) feedback- 
immediate or none,and 3) order conventional test admin- 
istered first or second in. the pair. 

In order to make the feedback relevant to tlie high 
school group, we had previously asked a subgroup of 
students from the same school to generate a set of 
statements which would, to them, indicate that they 
answered an item correctly. We used six such statements, in 
pseudorandom order, including "right on," "that's cool, 
now try this one." and "all right, how about this one ''This 
was done on the liypolliesis that feedback can have an 
effect only if it is meaningful or relevant to the testec 

The results Vor the tliree-w'ay analysis of variance ^re 
shown in Table 3. The only significant main effect was for 
race. Mean scores for the blacks was 1774 and that for the 
whites was 27.92, on the 40-item test. Neither order nor 



33 



ERIC 



39 



T%Btl ? 













Group 






N 




N 


Mean 






26.3g 


6 


13.83 


14 


2LO0 


Second 


7 


13.86 


6 


14.07 




14.23 


While* First 


15 


26.07 


14 


3U.93 


29 


28.41 


Second 


15 


30.00 


19 


25JS3 


34 


27.50 


BUks 


15 


2053 


12 


14.25 


27 


17.74 


Wliitci. 


.Vl 


28.03 


33 


27.82 


65 


27.92 


Fmt 


25 


26J7 


20 


25.80 


43 


26.00 


Second - 


22 


24.S6 


25 


22.92 


47 


23.83 




45 


25-53 


45 


24.20 


90 


24.87 



3 Way An ova 



Source of 




Mean « 






Variation 


DF 


Stjuare 


F 




Order 


1 


105.76 


L36 


.25 


Race 


1 


2,013.26 


25.84 


<00 


Feedback 


1 


8L74 


i.05 


31 


Race X Order 


1 


161.54 


^2.07 


.15 


Order x Feedback 


1 


28.74 


.37 


^5 


Race X Feedback 


I 


170.40 


2.19 


.14 


Order x Race \ Feedback 


1 


599.46 


7^9 


<.01 


, hrror 


82 


77,92 







feedback elYccIs were sjgjiificant, nor were any of the 
Iwo-way interactions. The iJircc-way order .\ race x 
feedback interaction was significant at p<.OI , ^ 

Figure 10 sliows the- means for the thrc?w2ly jnler- 
aclion. As is indicated in Figure JO, under conditions of 
immediate feedback, when a conventional test was admints- 
lered first, the mean of the black students (26.38) was not 
significantly different from the mean of the white students 
(26.0) who completed the conventional lest under=the-same 
set of conditions. Tins result implies, if it can be.rephcated, 
tltat race differences observed in test scores may be a 
function not of differences in ability but of differences in 
the psychological effects of the conditions of adininistm- 
tiun. Althougii thcbc findmg^ Jo not wumpletely leplicdtc 
those of Johnson & Mihal (1973), they dg support then 
general condustun that ounditiuns of tc:>t administiation 
might affcit motivational wonditiuns, which in. turn reduce 
rale group differences to nonsi^ificant Jevels. 

There is some data $n uui results which suggest that the 
three wa> interaction results miglit be due to motivational 
effects. In addition to analyzing (e^t scores, we alsu 
aniayzed the proportion of items skipped on the conven ^ 
fionaf test under the two experimental conditions and for 



the two racial groups. Tliese results shov/ed that blacks 
skipped more items tlian wJu'tes, in general, but v/hen the 
conventional test was administered first to tlie blackl 
students and they received feedback, lliey skipped almost * 
no items. This is also the same set of conditions under ' 
which the test scores for the blacks were not significantly 
different than those of the whites. This appears to be a 
motivational effect since when the blacks are ©ven feed 
back tlie test becomes relevant to them; and when ft 
becomes relevant they can answer the questions just as well 
as the whites. 

Future Plans 

Based on these prdimmary findmgs we plan to continue 
to mvestigatc the nature of feedback effects, and the effects 
of utlier psychological vanables, on test scores. We also plan 
to continue to study various branching schemes in an 
attcihpt to develop optimal branching schemes wluch result 
in maximum r<fduction in psychometric cnor at all abdiiy 
levels.,Our general goal, as I indicated earlier, is to explore 
all iispccls of computerized ability testing in an effort to 
make maximal use of the computer a^ a vehicle for making 
each individuaPs test score as error-free as possible. 
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ADAPTIVE TESTING RESEARCH AT MINNESOTA - 
SOME PROPERTIES OF A BAYESIAN SEQUENTIAL ADAPTIVE 
MENTAL TESTING STRATEGY' 



JMIESJLMCBRIDE 
Unirersiiy of Minnesota 



Adapav« or tailored icstuig subsumes a cumber of 
different siratepes for adapting Uie difficulty of test items 
to the ability of the examinee. One of^e inos3 elegant of 
such stratepes is a Baycsian sequential lechnique proposed 
by Owtti (1969) and studied empirically by several in^^es- 
tigators including Wood (1969). Uny (1971) and Jcnsensa 
(1972). . 

Owen's technique is a general one for the sequential 
design and the analyas of independent experiments with a 
dichotomous response. Its application In mental testing is 
to' the problem of estimating ability by means of sequential 
selection, adnanistntlon and scoring of dichotomous test 
items. Icie mathematical details of the method arise out of 
latent trait theory, with the item characteristic curves all 
assumed to take the iform of the normal ogi^'C. The 
properties of the normal ogive item haracieristlc function, 
and its lo^tic approximation, have been described by Lord 
A Novick (1968) and Bimb^um (1968), respectively. 

Owen's procedure involves the individually tailored 
sequential design of a test fay appropriate choice of 
av^laWe item parameters' (fl^, b^, c0 and estimation of 
ability via a Bayesian-motivated approximation. At each 
step m in the ability ostimation sequence, a normal prior 
distribuu'on on abDity (5) is assumed, with parameters 
(}f^.0^^), where m indicates the number of items already 
administered in the sequence. A test item to be adminb- 
tered at step mi-l is selected so as to mim'mize a quadratic 
loss^funcUon on 6. With c^=0 (ije.. no guessing) and 
disgimination parameters constant over items, the 
appropriate item is the available one whidi minimizes the 
absolute \3lue of the difference (b^"!^^)- Wth r^K) the 
optimal difference is somewhat .negative, that is. optimal 
difficulty is sortewhat "easier" ilun examinee's ability. 
Following item administration at step m^i, the parameters 
/i^, a^p, of the prior distribuUon are update* in accord 



. * Reseaich reported hciwn was supported by the Pcrsowicl and 
Training Research Fropams, PjychoJogical ScicnccsDMsion, Office 
of Kaval Research, under contract Ka 00014^7-A-Oi 13-0029, NR 
Na 150-343. ' . . 

Portions of these resulu were presented at ibc Spring meeting 01 
the Psychometric Society in Ioifc-a Giy, Iowa, Apnl j 975. 

A complete report of these rcsuIU is in preparation (McBridc 

Weiss, 1975a). • , , u 

^^s most commonly used, <r- and ^ rtspecuvcly arc the 
discrimination and difficulty paran^ters of Die normal o£jve modcL 
C, h the guessing parameter, the probabflity that an fcxanuncc wffl 
respond correctly to the item when he does not kno^v the answer. 
The subscript / indexes items. 



with the exaninee's performance on the item. In the case 
of a corrcd answen 

/ (I) 



and 



Following a wong answer 



and 



(2) 




In ihc above equations (taken from Owen, 1975) 
<5(D)is the normal probability density fundion 
*(D) is the cumulative normal distribution function, and 

(3) 



j>=(6x-/'«)//:t 



+ oi 



A=c,+(l-cp«-D) 
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ft^^^ 2nd cPjQ^jg, Ibc pzrzwttm of the Bzyes posteiiot 
<£5i3ibstsos on 6 lie used as the parameters of the next 
step's pnoi. At each step the pnoi dzstr&ution is tlken to 
be tK^ftati, an assumptioa which n not strictly correct zf tei 
the first item (0»'ca, 1975). Testing may be ttnranztti 
tvhen cr„ becomes aibitrarfly small or uiiea m becosies 
arbitrarily large, or when some other aiteiion has been 
reached. At tem^tion the latest is the estimator of 9, 
and o^jij is a xoas^c of the uncertainty of the estimate. 
«riy (1971) and iensema (1972. 1974) have interpreted 
<P„ as the squared standard error of estimate (SJEJE.) of 
Owen (1975) gjves a theorem showing that as rn -* », 

FracticaSy speakmg, of course, the number of items 
administered i^iQ never ^pproadi mfinity , but if the pool of 
avslabie Items is suflidently large and appropriately 
constituted, (7^^ wii diminish rapidly, permitting valid 
estimation of ^ in a very smaB ntirxiber of items. Uny 
(1971, 1974) has ^dfied the «equirements for a satisfac- 
tory item pool for impfcmenting Owen^s testing procedure 
and has shown in amiputer sinntlatson studies that Owen s 
se{]uential test can achieve in 'fTdftf^io 30 items the 
validity of a much longer conventional test, with the 
average number of items dinnnishing as their discriminatoiy 
po wr increased, ^ 

Validity, jjt^ the correlation of test scores with the 
simulated underlying abihty, is only one critenon by whidi 
to evaluate a proposed adaptive testmg strategy. Smce ;4e 
Bayemn sequential test scores are actually estimates, m the 
same metnc, of underfymg trait level, the accuracy of the 
estimates is also an interesting datum. By **accuracy'*-hcrc 
is meant the doseness of the estimates to actual ability, 
which ma> vaiy systemaUuJly with abih^ leveL Anuthex 
interesting piupcrty uf estimates is bias, ui enuz of central 
tendency. Two kmds of bias ^.ould be of some con 
^m. 1) unconditional bias, or group mean error of 
estimate, and 2) conditional bias, oi mean error of estimate 
at a p\tn level of the parameter bemg estimated. As a 
matter of con^Yntion, then, in the following the term 
"accurag^J* will refer to mean absolute error of estimate, 
(1/N) 2^l^rS,, **bias** will refer to mean algebraic error of 
estimate (1/K) 2^(^, ^j)* ^d "conditional bias" will refer 
to mean algebraic error of estimate at a s^ven value of B, 

The purpose oflhe present paper is to report the results 
of a series of simulation studies designed to investigate the 
influence of item po<^ characteristics on somS properties of 
the Etesian sequential test other than the correlational 
validity of the.trait estimates. These properties will include 
inas and ' accuracy of the estimates, as well as others 
enumerated below. 

The studies reported bejow were motivated by results 
obtained with five testing of Owen*s strategy. Ifcing a 
329-item pool of vocabuhiy knowledge test itemSj» a 
correlatioa of JBO was obtained between estimated ablQty 
and number of test items to terrmnatioh (McBride A Weiss, 
1975b). Simulation stucSes des^d to investigate the 



influence of the asm pool on that cneig>ectedl^ bsff 
correhlion Jed to out iiscoi^ of sy^itmtu. non&eai 
bias in the Bayesian estimates of abi!it>. The xuturc of the 
J?ias,andsoQ3e of its correlates, are disciissed below. 



METHOD , 

If 

1. Dfpendau ponabk^ of interest mduded test length 
(number of test items adm^zmtered before the termination 
criterion w:as'readied),^4^^a of «tiinal^ (0-S), bias of 
estimate (mean over indniduas of (6 -^)),ibsohile value of 
the error ^-BU ^d rah'dity of mt estimates of r^$. 

2. Independent vahabJes of interest induded the ^fleets 
of guessmg m both the ^^onse model .and the scoring 
algonthm, of item discrimination, and the correlation of 
c£f&cu!ty and disczirrjnation parameters zn the item pool« 
and of difTerdat temsnation criteria. 

3. Examinees for the Hist study were simulated by 
cbmputer^fenmtioa of pteudoratfidom numbers (from a 
normal pppulatioa with mean 0 and varianoe 1) which 
represent^ the at»lity 5^ <^ each examinee, i. For die 
second study, 100 examinees were simulated at eadi of31 
p<»nts on the ab3ity continuunL 

4. Item re^nses were smubted by comparing P ^5^) 
for each item g and examinee / with a random number e^^ 
from a rectangular distribution in the interval [0,1]. A 
score of 1 for examinee / on g was assigned if 
T^d^y^e^. Otherwise a scon: of 0 w^assigned. 

^ 5. Item pools were simulated tmder two different 
conditions: ' ✓ 

a. A perfect item pool with items of constant 
discrunmation and guessing para9>etei was rmubted. 
Under this conrntton, the computet program vomputed the 
optimal difficuUy b^^j of the next item to administer, and 
a simulated item with that ^fficulty value was made 
availa!^. This is referred to as a **perfect** item pool 
because m enect we have simulated an item po>l in which 
an unlimited mimber of items is available at any point on 
the difficult> continuum. The estimated optimal difBcuIty 
of an item to adimnist^ at stage mf-I is equal to the 
current abSity estimate, Sg^, when guessing is not a factor 
(fje., y/hen c^=0). When guesiEbg is a factor icJ>0), the 
estimated optima] difficulty is smaller than 6„ by an 
amount which is a joint function of and c^. That is, 
when (V^Jw)^* (Actually, the true opumal difli 
cul^ is a tiinction ofa^ c^andtheunlcnownpararneter ^. 
The Bsiyeuzn sequential test procedure only estimates 0 
and hence estima^ the optimal item difScutty. At any 
rate, Ae simglated^perfect^ item pocd rnakes avulabfe at 
eve^ step m an item iidiofe difficulty is txsiC&y equal to 
the estimated optimal item diflkulty based on c^, andv 
th^ then current estimate of 0). 

b, A dtfferentklfyditcrxmirtMting ^'perfect" item pool 
was simidated by hai^g unlimited item difficulties:^^ 
avaiiaMe (as in al above), but viryk^ item disoimination 
systematioljr so that die mean vcoold be^>edfied and 
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the icgrtssioa ola^ of ilcm dlfliculty codd be ^-zned. In 
lias yay it was possabk ic jsmulat^ item poob ia which 
more hi^Jy discnmmating items were avskHc in some 
Tenons of the abiBty coatmoum ihzn an others. The details 
<rf thas procedure arc destnbed m Study 2, bdow. 

6. 'n3^^Bx>'esian sequential test ovas simulated by a 
computer proigiam. Input irinat>iei were 0,-. the parameters 
and c'^q of ths mitral prior dhtribution on 6, the 
number of items to be administered to any exaimnec, the 
cor^stant discriimnalioa parameter of the pafeci item 
pool (or the mean di^criminaticn jjarameter of the d^- 
JatntidBy dbaimirtating pafcu itm pool), along with 
*-wo gi*«sing spedficatioQS. The first, C|, specified the 
propensty of the examinees to guess while the second, r^, 
specified whether gacssmg Vk'ss to be accounted for m 
scoring. 

Study!: The effects of guesang 

For this study the "perfect" item pod was used, with 
two values of c^c^= ( ^ paired with tsvo values of the 
personal guessing ietidencfcf {^O- posable 
pairwise combinations, only three wre used, resulting in 
three sets of conditions ^ 

no gue&ong 0 0 

uncorrected guessing 20 0 

corrected guessing J20 20 

In the first condition, no guesnng takes place (r,=0) and no 
- correction for guesring cnten into the scoring formula 
ic/=0). In the second condition c,-=.20 (evciy individual i 
has z random chance of correct response equal to 20\ but 
Cg=0 (guessing goes uncorrected in the scoring algorithm). 
Knally, in the third condition, the .20 guesdng parameter 
and the scoring correction for guessing take the same value 
In each condition, the same 100 *'exanunees'* {d^ 
sampled from a nomial (0,1) population) were adnurustered 
14 simulated Bayesxan sequential tests in whidi testing 
temnnated for an exannnee wheno*er the o^^, the 
estinwted variance of the posterior distribution of 5, fell 
below J0625 (this is equivalent to the Uny/Jensema 
CDterion of SEE <L25). The 14 simulated tests in each 
condition were experimentally independent, and differed 
from each other m the vahie of the ^ parameter, which was 
constant within a test, but which varied systematically 
across tests. The 14 <7 values were at- = 3, .6, .7, ^, .9, 1 -0, 
L25, 130, 1.75,2^2.25,230,2X3.00, 

For each test m each condition, the following variables 
were observed: 

a. mean and range of test length,^ 

b. errors of estimate, e,- = (Bp^i) 

c. tc$tb'ias,(l/N)2(M/) 

d. mean absolute error, (1/N) ? l^rOfl 
e: testvah'dityr^^ ' 

correlated error /S^ and r^^ " 
^ correlated test length r^k and r^fc 
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Study 2. The ^fecis of the configuratiOTi ofUfmpaTam- 
etcTS in the item pool 
Most samulatioa stufies cf Owen's sequential test have 
used z constant item fiscrirmnation parameter wilKn each 
test Typical item porfs in actual use, however^ have vzxying 
item d5scrirmnatioas, with the potential effect of having 
more iSsainnnating items avaHable^in some ranges of the 
trat levd than in others. In tlus study, different item pod 
conspirations were smiilated using the differen- 
daDy fecrinanating ^ptf.tcC item pool The approwmate 
correlation (i^ j) bet^vecn item discriminaling power and 
item difSculty vcds varied in order to obscnie its effect on 
some pippertEes of the Baycaan test and of the resulting 
scores* 

Three diffeieat values of r^^ simulated^ -.71, 0 
and +.71. With ''^^=.71, more discriminating items are 
available, on the average, at higher Jereis of «.Wilh j^-71 
the more Sscdrriinating items were available at the lower 
levels of d. And with r^^^, no fcvcl of ^ was favored in 
terms of available discriminating pm**er of the items, 
althou^ discrinsnadng power nws free to my randomly. 
In each *'item pool" configuration, the mean item discrim- 
ination Sg was set at 1.25- Additionally, ^ minimum Cj. 
value of .SO was imposed, in accord with Uny*s (1974) 
^commendation. 

The item pod configuration was simulated by means of: 

1) selecting the appropriate b^ for the next itern from 
the ^titct item pool as thougiL ^ were equal to^: call 
this£.»^=(b^l5„^P; 

2) olculatmg a conditional value from a Jmear 
transform of 6*^* 

where is the standard demtion of the 

parameters in the emulated pool ^ 

S-D,B Js the surdard deviafibn of the parameters 
in the simulated pooJ 

fl^, 6*^ r^3, 5^ are as previously defined; 

3) adding an error compopicnt, e^, to the approximate 
Cg, so that for each item adnunistered " ag\b^g ^fg 

where a*g is the amulated discrimihating power of 
the item 

Ogh^g is the approximate discrimination defined 

above 

ex is a random number from a population normal in 

ae=V^=S.D.A(i-/^^6)'^- 

4) setting fl*^ equal to ^0 whenever it would oUicrwisc 
» have 3 lower value, 

"Examinees" for this study were 3100 simulated 0% 
100 at each of 31 equally spaced intervals between -3.0 
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3J0, indusn^e. Hit coneuted gicssmg condiuon 
ic^^/^J20) m effect. The pustenux tzn^noe icmmu 
tioQ cnlenon (c?^<.0625) «as used, with an arbiti2i> 
JiWtcni maxununi Icsl length. At caJi of the 31 ki-cL 
iht foSowmg szn^blcs s^-crt obsc wd A^j cauKojdnidual.A- 

a. toti^gth.-^! 

c- ejTor of efstimatc, ^ 6^8 

l(vh3e $tad> ^ examined aveiase ch2j2acmt£C& of the 
Bavcahn lesl and test scores. Study 2 wss^nse^ed with 
certain properties of the procedure as a funct:on of tiait 
k\r!, and of the item poo! confi^ration, i^^. For each 
oxiGguration, the regressions of A, e and £^ on ^ were 
estimated frpm the means of the 100 individuals at each 
kvel of fl. 

AdditiooaOy, the data v^ere used to calculate empmcal 
of the mform^on function ^iip) of the Bayesian 
lest scores 6, The information at an> level 6^ may be 
calculated as the square of the ratio of the partial denvative 
with respect to 6 of the regression of test scores 0 on 0, tu 
the conditional standan! deviation (a^|^) of the test scores 
at the pven level of 6^ Has may be written 

p/ag(^^k?))j ^ (aftci Lord, 1970. p. 153). In each 

configuration for each of the 31 levels of 0, the conditional 
standard deviation was estimated as the observed SD. of 
the 100 test scores at that level. The numerator of the 
equation was calculated for each 6 po:nt from a third 
degree polynomial equation for the regression of 6 on 6, 
estimated by least squares fit to the thuiy-one mean ^ s 
observed under each item pool configuration. 

RESULTS 

Study J 

Tables U 2 and 3 and Figures 1, 2 and 3 contain the 
results of sequential tesung undei the three conditiuns of 
guessmg/conection for guessing, at each of 14 item 
discnminationJevels. Some noteworthy trends are. 

a. Test length was instant at each level in the no 
guessing (Tabie 1, Figure i) and uncorrected guessmg 
(lable 2; Figure 2) conditions, wth test length to termina- 
tion diminishing proportionately with the inverse of Ihe^j^ 
level- 
In the corrected guessing condition (Tablc;43 and I.gure 
3) test length vaned across individuals, while tnean test 
length withm level behaved in the same mannei as da 
test length in the other two conditions. One datum of note 
is the behavior of test length as a function of level, in 
order for all examinees to reach normal terrhination in less 
than 30 items (m the corrected guessing condition), the 
Item discnmination value must exceed LIS (fl^l.25)- 

Anothcr result of interest is an expected one. the 
corrected guessing condition required more items to termi- 
nation than did the other conditions. 



b. Enon of estimate, - moderately 
correlated with absfity 6 and test scorc P imder all 
uandiUons, as revealed m Tables i, 1 and 3. e, tends to be 
positive loi 5^<D and negative fui fl,>0- Ihis result was 
vUQsxslent, and rtflcvts a pcgressiv?n cffcwt .^uscd by Iht 
quadradc loss function employed in the item selection 
procedures. 

c. Test bias, mean absolute error, test ••alidity, corre- 
lated errcHS and correlated test length ^"alues for the no 
guessing, uncorrected guessng and corrected guessing con 
ditions are listed in TaWe 1, 2 and 3, respectively 
Additionally, Figures 1, 2 and 3 graph some of these v^alues 
as a function pfttg level within eadi condition. Noteworthy 
in these data is the sizeable bias and mean absolute err or in 
the uncorrected gucsang condition (Table 2, Figure 2), as 
wdl as the tendency for bias and absolute error to inaease 
at flj levels abo^'e 2.00 in the cwrected guessing condition 
(Table 3^ figure 3). Note also that in the uncorrected 
gucsang condition Cfable 2), test validity, r^a» tiecrcased at 
a, levels beyond 2-OC. Jensema (1972) observed this 
iSienomenon, whichiie termed "correlation drop-off* 

Study 2 

Tabic 4 lists the obsen'ed mean values under each item 
pool configuration of test score, test length, and enor of 
estimate for each value of ^. Figures 4, 5 and 6 depict these 
data graphically, 

a. Test iengtk Mean test length (Figure 4) did not vary 
with ^ in the r^^O configuration ^ce the maximum of 30 
items occurred at all levels. In the ^^^--71 configuration, 
mean test length covaned positively and almost perfectly 
with ability level. In the ^^+.71 configuration, test length 
covaried inversely with trait level, with more items required 
at the lower trait levels until the arbltraiy 30'item limit "wss 
reached. 

b. Test scores. The regression of mean trait estimates, 6 
on $ was virtually linear in all three configurations in the 
interval 1.5<0<2.O]. As can be seen from 
Figure 5, the, Baye^n test scores tended to underestimate 
$ at high trait levels, and^ to overestimate 0 at lew trait 
levels. The regrcsdon of ^ on 0 departed from a linear 
regresdon at extreme levels of 6 (beyond 0 - ±2-00) with 
the departure more noticeable in the lower extremes of the 
scale. 

c. Errors of estimate. The regression of mean errors of 
estimate on Q, seen in Rgure 6, dearly Illustrates a 
tendency of the Bayenan test scores to overestimate 6 
markedly and consistently at OK- 1 5 in all three item pool 
vonfigurations. The tendency to underestimate high d's is 
also illustrated. In this data the latter tendency was quite 
strong with r^^- .7 1 but less so with /i5+-7 1 - ' ^ 

Information. The estimated' values of the derivative \ 
^(E(§k?)J, the conditional standard deviation o^y^ and 
me information at each level of under each item pool 
configuration, arc listed in Table S.^Smoothed information 
curves for all three configurations are plotted in Figure 7, 
Some noteworthy trends are pointed out here. 
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TASLEl 



Test iMpht Mem Errors of Es^snzlc^ asd, Ccsieliies cf Ability 6 and Test Score ^« as a 
Fttoction of Itezn Disciimfnatioa is the Perfect Item ?ooL No Gaessai^ Condltioa ie^fO). 
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TABLE 2 

Observed Properties of the Baycsian Sequential Test as a Function of Item 
Discrtmination in the Perfect Item PooL Uncorrected Guesang (r^=t};r/=.20 
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Fi^re 1. Some observed properties of a Bayeaan sequential lest, 
as a function of item discrimination. No guessing; perfect 
Item pool; posterior variance tsmimation criterion. 
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Hgurt 3. Some obscnred properties of a Baycaan sequential test, 
as a function of item discrimination. Conccled .20 
g^essin^ perfect item pool; posterior variance termtna- 
tion criterion. 
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Figure 4. Mean estimated ability (8} at thirty-one ability points (0) 
for the simulated Baycsian sequential test under three 
Item pdol configurations. ' . 
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TABUI 5 

DcvviUm 0§ 9 and Vjiluc the Intfrnution ruiKtjon /^(Ol 
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1) Usdei 29 three item puui vunfi^mztiom che mfonna- 
£2cm functions i^^e very Ixm m the Jow end cf die 6 
distnbution; 

2) Foi /«^^.7l the infcisutiun values uiaf*jnn}^ in- 
creased wih increasing 9, 

3) Foi r^fyO infuimatiun ge&ei^> inoe^se J fvith fi. iu 
about 0 = IJM, then dcacascd^^incwiui, 

4) Foi ,71 infunnation o^vxc^scd .duipl> »iih , iu 
abcut 6=0^ (hen just as ^wpl> decreased 

DISCUSSION 

Study 1 

Test kngth, or number of items required to satisfy the 
posterior -variance tcnninadon criterion, was^own to vaxy 
inversely %iih item discriminatoiy power,^, i^hca the 
htter is constant for aS items in a given test. Inis result W2s 
expected, and corrcAorates the flndings 6f Jensema (1972, 
1974) who pom ted out that if wt^istaat itenr. dis 
vnnimato|> puwcis were available it ivould be possible tu 
pitdici the ^dii> of ihe irut esunrutes fiun» the numbei 
of Items adnunistered, and conveisel) iu estimate the 
number of itena rctiwred U awhievt an> ^en vzliJit^ 
levcL 

In the no-guess!ng and uncorrected guessing conditions 
(that IS, m tests whjwh assume nw guessing) the test length 
vbas constant foi anjr fixed value. Has result ^^oald nut 
be hkel> to otuit i^ilh s flmtc poo! of items djt to the 
inevitab2lit> of impcrfewt B wsihitemdiffiwult^ niatJics. 
thai iS, vkith a fmitc iitm puol somo franan^e m lest length 
would hKeI> vcvui even if aO itesns had equal disvnmina 
uon parameters. Tnt favt thai there v^i^ no vanani;e in 
test iengih iwithm an> given diswnnunation ]evel> witl^i Hit 
perfect item, pool mdiwaies thai aA> vanarnx in test length 
in a rea]» constani-djswnmmaDun, no^essmg test must be 
due so!el> to inadequacies m the distnbub or» of item 
difficulty parameters m the finite item pod. 

Ihese results are pertinent i%> the use of Rasch mudel 
abibt> estimation m an adaptive testing situation. Exuept 
foi the speuSwation of the item charawtenstiw funwtion^ the 
Ras;.h model is conLeptuall^ identical ^vith the no-guessing 
model used in Studv L Within eawh test» item discrimina 
Uon parameters vtere constant (as the Rasch model 
assumes) and noguessing was assumed. Thus the major 
difference between this porUon uf Stud> 1 and a Rasch 
model simuIaUon would be m the definition of the item 
response modeL We assumed a one-parameter nonnal ojdve 
response model, vdiereas the Rasch model uses a erne 
parameter logisUw one (Bimbaum, 1968, p. 402). As 
Bimbaum (1968, p, 399) has pointed out, the tWo response'' 
models are ver> amilai. Thus, the results of Stud> 1 foi the 
no-gucssing condiUon should be gcneraKzabk, to adaptive 
tests based on the Rasch modeL 

In the corrected guessing condition (Figure 3) there was 
some vanancc in test length fui all dg values (except 

- ,50, where no tesiees temunatcd in fewer than 100 



items), Fvn i3 lei'eJs abcic 30, test length G correlated 
siron^> and poativdy »ilh the trail estimate € (Table 3), 
The test len^ -B correbdon rg^. equalled or exceeded JSO 
for all lalues abo\T j6. The correlation r^i^'betv^^ecn test 
length and aWity 0 was of ^milar jnagiatude but alwa^'s 
^mallei than r^^. It seems obk5ous that for the case of 
uxistant item (^rincnatidn and non zero guessing there h 
a systematic relationship between abiEty 6 (k ttst scon: 6 
and number of items administered. Exrmnadon of the 
partial correlations, hosvevcr, sho*^ thatr^j^ \Mi$hes isiicn 
0 is statistically contrciled for. For instance, for 4^ = 1 -0 w 
observed r^jt = ^i, rg^ = -83, = SS. Contrmng for S 
andfi, regjectively, yields fte foBowing partial conclatioas: 

Analysis of the wnespunding partial ^A^nelations for the 
other leveL would yidd a smilai result, /^^^^approxi 
iiiaie]> zeiu, but /g^^ positive andmoderaU. Ihis suggests 
that, at least foi the constant item discrimination case, the 
tendeni.> foi /sjc to be posiuve is due to some Jiaracteristic 
of the traJt estimation method using the g?jessing correc- 
tion. 

Anuthei observaUon with regard tu test Ienc|^ has a 
piacticai apjh-caUoru Where the posteiiui variance tcnrana 
Uon vntenon is to be used, it is desirable that all or nearly 
all exammees reach criterion (c-g., a^m^ -0625 or some 
other arbitrary value) within a leasonab^ small number of 
items. Typically (e.g., Urry, Jensema), a 3D4tem maximum 
test length has been imposed in conjunwtion with the 
postenoi vananije critcnon. If alargenumbti of examinees 
leath the 30^tem limit before attaining the posterior 
v^nince cnterion, the lattci nia^ lose its usefulness as a 
predictor of test vahdity. The data of Table 3 (and Figure 
3) indicate that even with a "perfect** item pool, the 
constant item discrimination parameter must equal or 
exceed - \ZS m order to insure test termination in 
fewer than 30 item^ for the majority of examinees when 
guessing IS a fawtoi. Allfiough it is difficult to generalize this 
findmg to the of typical finite item pools, it is 
reasonable to expea that test termination via the posterior 
vanancc cotcnon <r'^<i)625 will seldom occur in fewer 
than 30 items m Bayesian sequential tests using item pools 
whose mean^iiem discrimination paiametei is less than 
L25. 

Enprs of estimate were moderately and negatively 
correlated with in aD three conditions, with the strongest 
coneiajions observed m the uncoirected guessing situation. 
That IS, with vonstant item discrimination and a perfect 
pool of Item difliculUes, lar^r errors 4>f estimate (& 0) 
tended to occur as 0 decreased* This tendency can be 
viewed as a regression effect. As is^ tyjrical with linear 
regression estimates for ail three conditions the estimates 0 
tended to be doser to the mean tl»an the actual values 0- 
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^ The conthfion r^^ htivrtca trait estimates 3 zadcn*^ 
W2S coasjstent5 of the ^aax sgn but Imrx magru 
tsde than r^^^ uith tbt no giiessSxis 2nd corrected guesan^ 
coaStiosis. 

, The mean cnoi of estimate, 01 Kas. ivas vhtazSiy ztix> in 
the no fuessing c<»3dIdon, irntO ^r^ i>^2me large (Tabfe 1; 
Bguie 1). Fox ^1-50 tlxere was a tendency foz poative 
ixas to occur. Sml}aii>, me^n ahsdute enor was quite 
constant until fl^"lJO, than beca:ne larger. In the coiiettcd 
gifssi n g condition (Tabk 3, Figure 3) mean absobte criai 
was fairi> constant aaoss levels, bal bias ft-as pustivc at 
low values^ dinmiishcd viituaii> to zero at intcrmediau 
levek, and began to inaease stea^Sly as a, inoeased abuvt 
2jO. ^ 

Study! 

Test knph. The data aiustrau Jcari> ihc cffca of 4tcm 
pool configuratiun im the v^^iielatian test length vnthS 
(or Ihc correladon u strong 2nd its sa^ was opposite 
that of the r^^ currtlation m the simulated ^em pouL ^ ui 
the r^^O configuration there was no variance m test lengthy 
due to the arbitrary 3C^tem hmil. The preceding three 
studies have shown, howevei, that with v4>nstant 4esi 
!en^ vanes directly with 0. Presumably diat rebuon^p 
would hold for the r^^O configuration if teslJength-was 
free to exceed 30 items). We have already alluded to the 
inverse relationship between test length and the rate of 
reduction in iho Bayes postenor variance. Thus, it should 
be dear (hat the configuratioii of difficulty and discnrnina- 
tion parameters m the item pool, wluch can be roughly 
described by the correlauon of the discnmination and 
<fifficulty parameters ir^i,), effectively dictates the rate of 
posterior vananoe reduction at any level of the trait 6. 
Furthermore, if a maximum test length is arbitrarily 
established (such as the iCktem limit used by us, and by 
Uny, 1974, and iensema. 1972) that lirml, m conjuncuon 
with the item pool configuration, may dictate regions of 
the 0 continuum in which satsfactoiy convergence of the 
trait estimates wHI seldom occur. 

Erwn of estimate. Study 1 found veiy hi^h validities of 
the trait estimates 0, mdicatmg that the Bayesian sequential 
test IS capable of ordenng simulatea examinees from a 
normal population quite well with respect to the vanable, 
0, underlying the item respon^. Study 2 was:moUvated by 
an interest m the accurucy of the estimates 01 rather than 
the correctness of ordenng, as a function of 0 itself. The 
^data showed clearly that the Bayesian estimates behaved in 
a marmer similar to bneai regression^ except at the extremes 
of the ngrmal distribution (0<rlJ and d>2.Q)^ TypicaUy, 
hnear regression underestimates the cntenon vanable above 
the mean, and overestimates it for values below the mean. 
Sudi was the case for the Bayedan sequential estimates, 
except that the underestimates bei^ame fairly sizeable 
(around .20) on the average for 0>?.JO, and uverestimates 
betame severe 0aigei than .5) in the lowti levels uf the 
traiL Furthermore, it was shown that the behaviui of the 
trait estimates varies as a function t>f the item pool 
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^^Sguratiiifi. Ihizs, hy «^ti4jQing ihc iltm poo! «,on£gu:a- 
tiXi for a hve-testzng itcn^ pud it ihoidd be pussfbk to 
w jnttd the accurao of the Sa>esan test scoies as 
estimators of the actual tmt kvd of the examinees. O&er 
alteraatives may prove izseful m lha £c^id. Scene vf thc^ 
win be ^fectissedbelow. 

Infanm^on. For the coofigurztion r^^*.?!, the m- 
fofmatian ol the trzil esturutcs appears tuinuease Iine2rly 
with 5, at least m the mter^-al [-JiXe^O.O]. Ihis is Rhat 
we rm^t expect, smce item ^Bscnminatiun inae^sed with 6 
m this 4XinSgurat2on. Note (Table 4) that mean test length 
in this c<^guration was 30 items foi 3^<^«£nd then 
decreased liiieady with for 6<£, reschmg a inean of 23 
iiems3t6=3J0. 

For the r^^O conOguiation the infoimationltmction 
appeared to talx the shape of an invierted (and rather 
asymmetnc) shaEow dish, with maximal mfornation 
attancd in the interval ((W9<Uj. Ihis should approxi 
mate, at least in its form, the inTormation structure 
resulting from applying the Bayesxm sequential test with a 
real item pool whose conSgurstion is based on Uny s 
(1974) prescnptioQ« It should be apparent that some 
efGdericy of measurement will be lost in the extremes of 
the 6 distribution, especially in the lowi extremes. Note 
tliat for these data, test length was a constant 30 items at 
all levels. 

For the r^yjfl configuratiai the information curve 
does not talor the shape one would assume intuitively. 
From Icnowledgc of the distribution of the discrimination 
parameters it would seem that the curve dould mirror that 
of the r^if^'Tl information but with jna>dmal infonnation 
at 0=-3.O. Instead it rather emphatically takes the con^x 
form. The test is ma^dmally eflident in the interval 
[- 1^0] , and rapidly loses cffidency elsewhere. This is a 
remarkably different result from what one would expect. 
The bluest item discrirmnation parameters were available 
at the low end of the 0 scale, yet infonnation was as low 
there [-2<0<-l-5] as it was where the lowest item 
discrimination values occurred [1 .5<5<3,Ol. The low levels 
of infonnation in the low 0 rc^on are due in part to the 
small number of items administered there, ^s Table 4 
reveals, the postenor variance termination criterion resulted 
in mean test length of 14 items at 6^3X); 17 items at 
^^2.0; 22 items at e=-li) The information values ob- 
tained with these test lengths could be adjusted statistically 
to estimate the information values for constant^Oitem test 
length. Such an adjustment w^uld still show an efliciency 
loss at 0< 2.0 for this Item pool configuration, despite the ^ 
hi^ average item discrimination in that re^on. We wHI 
address this problem further in the discussior^ to follow 

Implications. These results were obtained by simulating 
a "perfect** item pool, i.e., a pool in which unllrnited 
numbers of items of aity difficulty level were ayail?i)k. This 
should result in data, t^ch, withm the limits uf samplmg 
enoi, approximate the best possible results obtamable usmg 
tne sequential Us^ng procedure as speafied by Owen 
(1969), under the conditions studied. 



We bzvt found, as dbd Lti> (197!^ !W) 2Sii}emtna. 
(1972, 1974) bcfort us* ibal the pnxcduit has ihc 
potential to yield iraal csumaics haiCT^ t«i> %^ditics 
wilh great ecoQO[n> a test Ics^, pimided that 2iigbl> 
<&scnnii7utms test items, rcUzn^iIari> ^tribatcd on 
diflica!t>, cuaaluic £bt jicin puul. We h^T jliw fuund thai 
there may be a icndcn^y irf the nscthod io ^Aiercstimatc 
group mean Uaii 3e%^l« s^hea ^lem d^scniTunation paia 
meten bic yei> bi^, c% cn when t!ie trait cnimatioa moGcI 
exactly ^^fuims to the item .espouse i7H>dd. When the 
esunutioft fnodsi nut «^on^ueni Ajih &c item xespiMue 
model (as in the onLvncvtcd ^essin^«^nditiun uf stud> 1} 
we hare found that lathei salable bias uf estinule may 
weur» accompanied by dinuni&cd validity. 

Lord (1970. p* 152) made the pcanl thai evahsating a 
tailored test by means of a group staiisdc (such 2S our 
vabdity cocfficaeDt r^g) presume some knosded^ of the 
group's distribution OQ the trait being measured, and 
Ignites mfomiation relevant to the accuracy of trait 
estimates at any one le^'el of the trait* The ^^alidity of the 
Bayesan sequential test trait estimates was, as we have 
seen, quite hi^ under the condiUons used m our simubtion 
studies. The accuracy of the estimates was also favorable in 
what corresponds to the middle ranges of a nomial 
distribution on fi, but was found to be less favoiable in the 
extremes, espedaBy the lower extreme. Similady, the 
infomiation functions of the trait estimates ^owcd that 
the effectiveness of measurement imder the Bayesian 
ytailoriflg proc«iure varied systematically as a function of 
the configuration of the item parameters constxtutingihe 
item pool, but in all three configurations measurement 
effectiveness was veiy low in the low ranges of the trait. 

The observed loss of accuracy and tnfoimaticm in She 
extremes of tftc nypicaP range of 0 are disturbing, since 
the advantage of taOored testing over a>nvAitional testing is 
the former's supposed potential for superior measurement 
accuracy and-effecUvencss m those extremes. From our 
data it is apparent that with the exceptiOTi of the ii,^+-7i 
configuration, the sequential test scores are behaving much 
hJce convenuonal test scores, at least m terms of the shapes 
of their mformation functions. And even for the r^i^^Jl 
configuration measurement effectiveness was relatively 
poor m the lower exucmes of 6. Ihe utility of the Bayesian 
adaptive testing strategy may be dunmidicd considerably 
by results hkc those reported for 'Study 2, if they prove to 
be general. 

The problems revealed in Study 2 (of bias^ non-linear in 
0, and of convex mformation stmctuies of the trait 
estntiates) ha\iB causes which may be amenable to improve 
mcn£. At the heart of the problem is the effect of guessing, 
which generally operates to reduce measurement cffidcn^y 
at all trait levek, and espccuDy at low trait levels. Also at 
the core of the problem is the Bayesian procedure itself. As 
we have pomted out eadier, (he Bayesian trait estimates 
behave bke regression estimates. Extreme values of 8 are 
systcTiaticaQy regressed toward the initial prior esti- 
mate, the assumption of a normaj pnoi distribution uf 0 



ensures ^ tendency. No». the snon extreme S h for any 
jD&idual, the largei »ia be the legressaon eflt^L on the 
^ma^. Retin that the ittsk sek^doa procedurr seJerls an 
item R'ith difliculty samefthat eaaei than the junto 1 9 
esamalt. But fi^i hiffx 8 the viinent cstimaie h almost 
^3ys tuu loi*. Iknu the diflSudty of Ihc se^tasd item 
wSl almost always be too easy for extremely abk exam 
mecs. Cvxadzttd ovei, say 30 items, effects of this 
inappropriate item selection will be sei^ral 

1) mean proportion correct vtiQ tend to inaezse as a 
funUioQ of fl, dts^iit the expSidt attempt of^ic lailonng 
procedure to make it constant at aB levels of 0, 

2) 8 tend to be imderestimated f<n high 8 due to 
the inappropriate d^fic^ty of the lest items adnunistered; 

3) information loss will occur at high 8 due to the 
diallowingdope of thereg^ssicm of 9 or ^. 

For low 8 the initial prior is an overesfimate. Henoe. the 
first item scltded vnSl generally be too dSficult 
[(6^-^)>0], yet the exaimnee has a non-zero chance of 
ansttwng it correctly. A correct answer, of course, will 
cause arrincrease of and thus result in another inappropri- 
ate dioice of item difiiculty. Furthermore, as Saiocpma 
(1973) has diown, there may actually be negative informa- 
tion in a correct response to an item wliose difficulty bg 
exceeds an cxananee's actual ti£i fcwl by a fairly smafi 
increment, wlicn guessing is a factor We suggest that 
examinees in the low extremes of are rather consstently 
being administered overly difficult items I(2'^-^)>01 with 
several systematic results: 

1) mean proportion correct tends to deaease with 0 
despte the tailoring process; 

2) posterior variance reduction tends to be more rapid " 
for individuals of low trait levels, due largely to their 
sub-optimal proportion of correct responses, resulting in 
shorter mean test fength; 

3) -the porter the test length, the less opportunity the 
Bayedan estimation procedure has to convert to extreme 
trdt level estimates; 

4) non-convergence combines with negative information 
in some correct responses to diminish severely the effective-, 
ness of measurement in the low re^ons of the trait 

Some of the conclusions just stated are speculative 
Specifically, we liave not looked at proportion correct as a 
funcdoTi of 0, not at the quanti^ of v*ich 

bear on the appropriateness of the tailoring process Future 
emulation studies vwU be necessary to examine these 
variables. 

One goal of adaptive testing should be to achieve a 
constant high level of measurement effectiveness at all 
levek of 0. This deaderatum is equivalent to a high, 
horizontal information function. We have found that the 
Bayesian sequential test failed to achieve Ih^ goal despite 
an unrealistically favorable set of circumstances' the per- 
fect item 'pool, errorfrec item parameters, and a scoring 
model perfectly congnient with the item response model 
We have attributed the shortcomings of the Bayerian trait 
estimates to the regression-like tendency of the seque/itial 
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ixtm selection for inSyidwHs v>fiQS^ imi levels aie^- 
Utmtly or low, 

Iherc least iwu iDcthuiL MacLojdhn^ ihi^ 

prubkm, bu!h uf sviixK ^vnd^ to ^-^me cAierit, lessen ihe 
b&as uf oUmatt at Ike cAtrerae^ ^^»d m*^x^\z the mfvmiA 
Hon structure of the inil estimates. The Gist methixS 
invdm the ^ssomptxon wf a xe«.t2n^ulai lathet than ^ 
oormal mot Astiibutian of 0. Ihe sev^d method »uuM 
invol\^ ie{^aung the pic^t item sekvtion piuixdute svith 
a mechanical braoiJung pioceduie whiwh wuuJd be le^ 
sena&ve to laige cuua in the wurrcni uail estimate iB 
diutce uf the next item tu airrznfitei. Needless tu £3> , buth 



ol these alternatives du iAms^detabk liv^eowe !c Owtti\ 
el^nt procedure. 

If the practitioner is comidtled to the procedure as II 
1V2S oxi^ndly proposed, it ivould seem th^t the best course 
u( aUiua »vu!d be Xo lake £reat caie in assembling the item 
PxaA, and to adnnnxstej a vonstant nymbei of items (say 
iO) to each eAamisec. If no stion^ vommitment tu Owens 
i ^dure IS mvoSved, the practitionei may be sv^ell advised 
lu use anothei adaptive strategy , sudi a& Weiss* stiadapii^^ 
test O^'eiss, 1974), Lords (1974) maximum likelihood 
pivceduie, oi a smiilai pioceduic bemg investigated b>. 
Samejuna (1975). Svstcmatii.jnvcstiS2tiun of some of these 
strate^^es, winch fvill pemuL them tu be cvmpaied with the 
Bayesian sequential test, arc currently In progress. 
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AN EMPIRICAL INVESTIGATION OF WEISS' 
STUADAPTIVE TESTING MODEL 



BRIAN K. WATERS 
VS. Air Force Hurmn Resources Lcboralory 



This study* invcstipted the validit> utility of the 
stratined adaptive {^'stradapiive") compulcrired testing 
model proposed by Weiss sni colleagues in the Ps> Jio- 
ineui& Methods Fxogxam, University \A Minocsuta. Weiss 
and his ^issoaates have lepoited the theox eticzl develop- 
ment of ihe stradapuve model (Weiss, 1973, DeWitt and 
Weiss, 1974; McBride and Weiss, 1974) iaduding some 
examples of indiriduai results. To date, no full emjHrical 
studies of tlje model have been published. 

The Stradapthe Testing Model 

Loids theuaetival aaalysis of adaptU'C testing versus 
wnvcnuunal testing make> one pjinl very Jtdi. a peaked 
test provides mme pre use measurement than adaptive 
test of the same length vchen the iesiee's ability is at the 
puint at ii^hich the conventional test is peaked At som 
point un the ability lA^ntinaum, generally beyond ±5 
^undaid demUons fiom the mean, the adaptive test 
lequue^ fevvci items f^/i vumpaiable ix^easurement efll 
dency. 

Loid sug^^sis that an "idcaP teslmg strategy would 
preaeni a iauiplt of items to eaJi subje*,t ujmprising a 
peaked test with a 50 probability of a correct answer for 
examinees of the pai Uwulai subjCvt's true ability {P^ - SO) 
The vaivh.uf wouiic, IS that the tiae abiLty of the subject is 
unknowTi, the estimaUon of whiJ\ is, m fawt, the desired 
outcome of the measurement procedure- 
Ira ditionally. this problem has been circumvented by 
peaking the test at = 50 foi the hypothetical average 
abihty level subject. This procedure worked well for 
examinees near thecentei of the abihty continuum, but less 
eflRdently near the extremes. 

Weiss' sti adaptive model extends the Binet rationale tc 
compuiei4>ased ability measurement. A large item pool is 
necessary . with item paiamctei estimates based upon a large 
sample of subjewt^ from the same population as potenti^ 
examinees. Items are Swaled into peaked levels (strata) 
according to item difficulty . A subject s initial item is based 
upon a previously obtained abihty estimate or the subject s 
own estimation of his ability on the dimension being 
assessed. 



Figure 1 depicts a nine-strau distribution of items in a 
hypothetical stradaptive item pool. 

As in the Binet, the subject's basal and ceiling strata are 
defined, with testing ceaang when the ceiling stratum has 
been determined. A subject's score is a function of the 
difficulty of the items answered correctly, utiliang various 
scoring stratepes (Weiss, 1973)- 

The hem Bank 



Verbal analogy test items were used in this study 
selected from the SCAT Series 11.^ This test series provided 
a ^gJe-format, unidimenrional test with extensively 
normed item parameter estimates. The item format y»as 
easily stored in a compute; item file, being short and 
standard for all 244 items. 

Item pool data received from Educational Testing 
Service contained five SO-item verbal analogy tests. Forms 
1 A, I B, IC, 2A and 2B of the SCAT Series 11 examinations. 
These tests had been nationally normed on a sample 3133 
^Sr'elfth grade students in October 1966. P-values and 
biserlal correlations on 249 items were provided by ETS. 
These values were transformed into normal o^\t item 
parameters. 

Table 1 ^ows the actual distribution of iteins used in 
this expenment. The final pool included 244 items giouped 
into 9 strata Recording to normal ogive item difTiculty 
parameters as shov»*n in TabSe l. 

The nine stwa m Table 1 are essentially nine peaked 
tests, varying in average difficulty from 2.12 to ^l.9L 
Stratum 9, the most difficult peaked test, for example, was 
composed of 19 items ranging from = 1.27 to 
^^=3.68. In this study, items were randomly ordered 
withm strata, unbke in Weiss' model, m order to peimit an 
^temate-forms relubility coefficient to be calculated for 
stradaptive examinees. As is typical m educational and 
psychological research, the con^ntiatiun of more difficult 
items u}n tains the lowei disaimination values. A correla 
tion between bg and of 31 reflects this problem. 

Subject Pool One hundred and two m coming freshmen 
to Florida State University were tested in late July 1974. 
Ninety-nine of the subjects had Florida Twelfth Grade 



' Thti pipci ts based on ihc authoi*s doctoral dtsseitation 
conducted at Florida State Lniversity undei the ditection ot Di. 
Howard W. Stokei. Requests loi copies of the dissertation shouid be 
sent to the author c/o A FIIRL/FT, Williams AFB, AZ 85224- 



'Test materials from SCAT Series H Verbal Ability tats were 
adapted and used with the {permission of Education Testing Service. 
The authoi of thispai^ei gratefully acknowlec^es the help of FTS in 
the pursuit of this resezrch. 
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(!2V) Verbal Scores ai 12V estimates dmvtd Uom ACT ci 
C££B verbaL^a>ics lo serve as oitem fca ihe v2Ldit> 
in%rsti£ztion of the^tiadapUi^ tesl scores. 

Table 2 depiLi^ lissai ft>>U2daptne pvup icsiMatssucs 
on the 12V scores. 

As can be seeri in T/ok 2, the random assgmncnt^of 
subjects to linear or stradapii*r testmg groups did a good 
job in equating the groups on the ^bStty condnuum as 
presented. 

Testing conunued Jintil a subjca s ^xilmg sUatun^ i^-as 
identifled. fui this ^ttid>, the ijeshngsu^ium va^ defined as 
the lowest stratum in vMdi 25% or less of the itens 



measured by the Rorida 12th Grace Veibd test 

Sana SCAT V published resdts had shorn $Jg--ificanily 
different diflicdty levels between the wc forms^ linear 
subtest Mits wett cof malized ivitiun their separate distr^ 
buttons and -then pool^ into a linear total score distribti- 
ison for comparison ivithstradapli%ie results. 

CRTTcsring 

A 4omputet p:ogiam described b> DcWitt and Weiss 
{1973) was adapted tu fit the FSU Cuatrol Data Cwpora 
tion 6500 computer. 
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F«ure 1. DiiUibutionofitemj.by difficulty level, inaStndaptiveTejt 
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TABLE I 



Itm D'iifiahk$ (t>) and Discnminstions <a), Btscd on Komol Opvc 
P^xzz^eter Estimates, for the SUadapthv Test Item Fool 
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D)ls tt9 w»>isml9ne7 to strtvn 6 nis^r tS^ $. rofttfojtcT/Too ti4>i«ct» r«tcfetd l.*»e Jtec U vti Strxdept'.ve Pool. 



TABLE 2 

Compailson of Distributions of Linear and 
Stradaptive Group Florida I2th Grade Verbal Scores 



GROUP 


^SUBJECT 


MEAN 


STD DEV 


STD ERR 


KURTOSIS 


SKEWNESS 


LINEAR 


46 


33.26 


5.30^ 


' .855 


.44 


.70 


STRADAPTIVE 


53 


34.06 


6.12 


.842 


.36 


-.03 


/'^0:lin = iisU)=>. 
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Testing Sequence. The subjects csUmaicd Ihcii jbflLt^^ 
using the procedures described m DcWltt and Weiss. The 
lint Item thai the suadapme subjca fccened iht firit 
item in the stratum 4.cmmcnsufat& ml^i his abfLt> estimate. 
The subject »as then branched to the ilist item in the next 
hi^ei ui luKei stratum depending upun v^hethei the initial 
response was cunc^^i sa mvorrect. If the subject entered a 
4jue$tion mark the next item in the same stratum was 
presented. 

Tesimg cuntmued until a subject*s ceiling stratum nas 
identified. F<ii thxsstuJ>, the veikngstiatum was defined ^ 
the lowest stratum :n which 25% oi less of the items 
attempted were answered cuiicvtl>, with a wnstraint that 
at least 5 items be taken in the ceiling stratum. The 1S% 
figure reflects the probabiIit> of getting an item ri^t b> 
r^dom guessing on a 4-optiun multiple choice tcsL Once a 
subjects ceiling stratum was defined^ the piogram looped 
back to the examinees abilit> estimate stratum and 
commenced a second stradaptive test with item selection 
contmumg dowii the item matrix from where the Hist test 
ended. Smce items were xandum]> positioned withm each 
stratum, parallel, alternate foims weie taken b> all subjects 
whu reached termination cntcnun «>r» the first test. 

A inaximum of 120 items pei subject wasestabhshed,as 
pie-stud> trul testing suggested that subjcwls be^mc 
saturated beyond this point. 

Termination Rules. Weiss had two versions of his 
^ti adaptive testmg computer piogiam. Version one, whiJi 
was used in this stud>, piesented another «tem in the same 
stratum when a subject skipped an item. 

The author of this study W3S unaware of the existence of 
ihe second bi an Jung stiategy piogiam pnoi to wmpIet*on 
of dau collecUon. Howevei, Weiss' piogiam pioccdure of 
ignormg skipped item^ in deter mining test teimination was 
questioned. It appealed that valuable infoimation was bemg 
lost when the Weiss procedure was followed. 

Ii was reasonable to expect that a subject would onut an 
item only when he felt he had no. real knowledge of the 
correct answer. Thus, investigation of test termination 
based upon omits counted as wrong ans^vers was judged 
appropriate. 

Weiss had set 5 items in the ceiling stratum as the 
nunimum constraint upon termination. A secondary goal of 
the present study was to determine what effect the 
reduction of this constraint to 4 would hav^ upon the 
effectiveness of the stradaptive strategy. 

These two questions of the handling of omits and the 
variation m the cqistraint on the termination of testing 
created the following three methods foi compaiLson^. 
Termination Method 1: 

Omits ignored/constraint = 5 items 
Termination Method 2: 

Omits ^ wrong/coastraint - 5 items 
Termination Method 3: 

Omits = wrong/constraint = 4 items 



Data ftas ^rfJfcted-CTnp, Temdnatioa^cthod^ 
then xescored osin^ Methods 2,and 3. This was possible 
sini;e no mdi^ation of the termination iit first test was 
given to the subjewt and since items weit xandoniI> oidexed 
withm suau. Once test teimmation was xeached u^g 
TeimsnaUon Method Z os 3. the next iterii taken b> the 
subject in his entr> pomt stratum acted as the start of a 
paiallel forms !cst under the termination rule used. 

, Of course. Method 2 required fewer items than Method 
I and Method 3 consi&rably fewer than Method 2. The 
thrust of this mvestigation, then, was to determine the 
relative efficiency of the three methods m «^mparison with 
^e another .and with luieai testmg aftei equalizing test 
length usmg the Spea;*r3;- Brown pro|^c> formula. 

Stradaptive Test Outf,M. Rgurc 2 provides an example 
of a stradaptive test report from this experiment. A 
next to an j tem indicates a correct response, a an 
inconect response, and shows that the subject omitted 
the item. 

The examinee in J^gure 2 estimated her ability as **5J** 
Hence, hei first item was the first iteiji in the Sthstratun;. 
She coirectl> ^wered this question but missed hei second 
item, and aftei responding somewhat inconsistently fox the 
first nine items, ^'settled down*' with a very constant 
pattern fo: items 10 thiou^ 19 v^eashe reached stopping 
rule cdterionjand her first test terminated. 

The testing algorithm then selected the 6th item in 
V^atum 5 Qiict ability estimate) to commence hei sevond 
teat. (The subject was totally unaware of this occurrence as 
no noticeable time delay occurred between her 19th and 
20 th items). 

At the conclusion of hei 3Islitem, this subject reached 
termination vntenon foi hex second test, was thanked for 
hei help in this reseaiwh project, and giv'cn.hei score of IS 
vonect answers out of 31 questions with a percentage 
correct of 48.4%. 

The Scores for this subject are shown for both tests. The 
interested reader may gain a more thorough understanding 
of the scoring r7>ethods used in this mode! by tracing this 
subject's abiUty estimate scores through Table 1 . 



RESULTS AND DISCUSSION 



Test theory suggests that measurement efficiency is 
maximized at P^=-50 for a g^ven test group. It was 
nypothesized that the stradaptive test strategy would more 
nearly approach this standard than the conventional linear 
test, indicating an improved selection of items for the 
stradaptive subject. Table 3 shows the result of this 
comparison. It clearly indicates significantly different distri- 
butions of test difficulty. The stradaptive test was far more 
difficult than the linear test, with a smaller variance. 
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sgygs oa STfJWim test i 

1- SIff lOJLTY OF lOST OIFFICai ITW O^JlCT-.24 

2. ClFnCULTT CF TK£ HO W IHH* -11 

3. SIFFICaTT OF HIGHEST HOS-CWCE ITEN CC«£a».24 

4. OlFFICaTT OF HICHEST STWTIH «I7H A CD«CT 
;j^SUDt- .04 

5. OIFFICaiY CF TH£ K*l TH STWTW*'-0< 

6. OIFFICaiT OF alGKEST lOf-DtMCE $Ta*7W-.04 
7- JVTERPOUkTEO 5TRAPJM DIFfICULTr«.06 
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9. MEW oiFFican OF ama itds EnuaN ^ 
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10. KSAJC DIFFICaiY OF ITtK CC^CT 
AT HICiEST II0JH3«AXCC STRA7UH- .09 



.474 



ICNUCER- 2S33S4070 
(EAST) 



JSEPORT OM STRADAPTlVE TEST 2 

OliTE TESTED- 74/07/29 



STMTUK 



I 



« 5 6 

20*^ 

22*^ - 

24< - 
.>25. 

27*-^ * 

. ^29- 

83 O.OO 



(OlFfiaJLT) 
7 3 9 



1.00 

TOTAL mmJlOi CORRECT' 



.SCO 



SCOyXS CH STRACAPTIYE TEST 2 

1- OIFFIOlTT OF MKT DIFFICULT ITEM COIOECT- -.1) 

2. DIFFICULTY OF THE TH ITEM- .3i 

3. DIFFICULTT OF HIGHEST MOK-CHWCC ITLM CORRECT- -.11 

4. OIFFIOlTf OF HICKEST STPATW 
VITH A CORRECT AMSVEK- -.25 

5. DIFFICaTY OF THE H-1 TH ST?ATU>*- -.25 

6. DIFFICair OF HIGHEST !tON-CKAfCE STWTUM- -.25 

7. IKTERPOLATEO STRATUM OIFFICaTY- -.18 

8. JCAM DIFFICULTY OF ALL COIttECT HEMS- -.2$ 

9. KM DIFFICULTT OF COWECT ITEMS SETVEEN 
aiLIMG AND 8ASAL STMT/.- -.21 

10. MEAM DIFFICaTY OF ITEMS CORJtECT AT 
HlfflEST HOHrCKWCE STIIATIM- -.21 ' 



Figure 2. Example ofstradaptive testing report 



TABLE 3 

Comparison of DifHcuIty Distributions (P^) 
for linear and Stradaptive Groups 



GROUP 


?f SUBJECTS 




STDDEV 


STD ERR 


KURTOSIS 


SKEWNESS 


LINEAR 


47 


.752 


.123 


.018 


-.87 


-.39 


STRADAPTIVE 


55 


.584 


.084 


.011 


J.L4_ 


J.97 



*P.(,iSwnUit) = <.000I 



••/V(c'Str = o' yn) = <.OS 
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Linear Test Rclabibh. Maktn^ ihc >Lsndiid^i^umpUuni 
undedying the one i^^ii^i iandum clTcvi^ ^nalyMs of 
?ati3ntr (AHOV'Al. chc-ounuied ieksbuii> vuefTiueni uf 
the total scores iS Aura ui Tabic 4 fui - li;.^* 
examinees. 

The internal misistcncy reliability estirmie for the 
linear test was .TJii foi a ic^i uf ^ a\nagc of 484 items m 
length. Stepped-up t^ 50 item^ vu the Speaiman-Bi^^wn 
Pro{^eb> foimula, this estimate be«^mes .782, The it 
poited rehabikty ^ the oiijanal SCATV tots was 
U^ng Feldf5(1965) tcst^i^^fp^^^^ =P/m) = <-05. 

It can be assumed that ihe difference between these 
leliabLueb was ^au^d by one oi muic of thice fj^tui^. 

1. Testing mode (CRT vs papei and pendl) 

2. Elimination ^f 6 of the 250 items ftom the ^^i^nal 
ilem pool. 

. 3. Restribtion <iT lan^ in ^ubjt..t pi/ol foi this expeii 
ment. 

The Utcet fawt^i most !ikcl> ^osed t^ic decease In the 
rdiabiUty of the test scores. The homogeneity of the 
subjects would yield a relatively small amount of between 
person variance, whidi would lower the relbbllity estimate. 
It nught also be mentioned that Stanley noted th4t 
mtravlass item wOirclation is a luuei bound to the ie}i^bilit> 
of the average item. 



Sffmhpiiit TutalTcst Reliability, t'sing Stanley's 
(1 971 \ p2U4xdare, it was pussible to estinute the mtemal 
«.^?r«^tenuy fcliabihty the pcxsi^ by item stf adaptive test 
matrix. Of the 244 items in the stradapiive pool, only 133 
items u^re actually jfresented to the subject pool in this 
experiment. 

Weiss* Swonng Method 8 provided the only set of 
stiadaptivx test ^votes whciem a peisun*s total test Scorc 
¥^as a Imtdi functiun of his item SkOies. Flenve, this scoring 
method was used to estimate mteinal vonsistenvy rehabil 
iiy- Table 5 summarizes these results. 

Table 6 show^ the parallel forms and KR 20 reliability 
esiimatcs foi tl^ic thiee termination rules used in this study. 
Direct comparisons can be made between the stradaptive 
A7^ 20 values and the .782 Imcar KR-IO estimate. Accord- 
ing to Feldt's (1965) approximation of the distribution of 
KR-lOn all of the esumates of ihesttadaptive test rehabdity 
are significantly Co = <-05) better than the linear /r/?-20 
estimate priof to being stepped-up by the Speai man-Brown 
formula /V (.675 <P2o < -858) ^ .95. Thus, the 19, 26, 
and 31 Item siradaptive tests all pi ed more rehable than 
the 48 item linear test. 

A compaiison of the nneai mteinaiiAinsistency reliabil- 
ity coefficients (r^j^j and the stradaptive parallel-forms 
reliability estimates (r--) in Table 6 must be considered 



TABLE 4 

Analysis of Variancx for LincarTcst Person by Item Matrix 



SOURCE 


df 


SUM OF SQUARES 


.MEAN SQUARES 




46 


37.57 


.817 


taut 


2229 


40S.55 


.183 


ToU! 


2275 


446.12 





TABLES 

Analysis of Variance of Scoring Method 8 
of Stradaptive Test Person-By-Itcm Matrix 









SOURCE 


df 


SUM OF SQUARES 


MEAN SQUARES 


T 






Person* 


54 


191.941 


3.555 


E 




± 


Error 


1675 


588.253 


.351 


R 
M 






Total 


1729 




(%o = .901} 


I 


R 




Persons 


54 


178.870 


3.312 


N 


U 




Error 


1401 


470.442 


.336 


A 

T 


L 

i; 




Total 


J455 




(r,, ^ .899) 


I 






Persons 


54 


155.841 


2.886 


0 




3 


Error 


1001 


366.447 


.366 


N 






Total 


1055 




(r,. = .873) 
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only tentatively sintpc ihe> are diffcfcnt kinds of esumate^ 
of the Uuc lehability. The sampling JibtnbuUon of /v^ is 
known and that of r^j^ has been appioximatcd J)> Fcidt 
(1965). Geai> £l Linn (1969) vumpaicd standard en ois of 
both indices with generated data of fjiown p. The> fyund 
the standard enor ofKR 20 to be somewhat smaller than 
that of the parallel test woirclation (approximately .05 vs 
.04 in the range of reliabilities, number of subjects, and 
number of items involved in this cxpenment.) 

Lmcar Test Validity. The vuiiclaiiun of obumed Iincai 
swurcs with the Flonda 12th Grade Svoics was -4??, whiJi 
was significantly lower than the published SCAT V.SAT v 
correlation of .83 (p = < .01 ). As with the Hncar reliability, 
this differenix most hkely resulted frum ^ubjewt huxnvigcnc 
ity. 

^ Stradaptnt Test Validity, The vaLdity vueffivicrits of 
Uic stradaptive scoring under the three termination rules is 
shown in Table 7. Validity was estimated by the vonelati^/n 
between the tjst scores and 12V Scores, None of the 
vahdity uiefTicients in Table 7 were signifiumtl^ different 
from the linear vahdity coefficient of .477, although 
stradapuve vahdity cocffiuenis were von:>istcnily higher 
than the linear indices. 



Xumber uf Itam. Table 8 show:s the difference in 
Lumbei uf items presented fur the lincai and the lljrce 
leimmation methods of the sUadapUvc test. The ^unsb 
icnwy m-avrrage nombei of items presented pci ^ubjcvl was 
suipnsmgly «.oiutant over (he two parallel lest> of termina 
uon methods i and 3. Method 2 did show a signifiwant 
y> - < jOS) diop m the average number uf items on ilic 
^wond test» possibly due to the 604tem limit. 

Item Latency. It was hypotlieazcd that mean item 
iaienvy wuuld be higher foi stradapt«ve subjcwls sinwe they 
would have to "tlunk" about cadi item as it was iicar the 
limit of then ability. Tabic 9 rcflecls the results of this 
comparison. 

ITie hypothicsis of no different between itemlatenvles 
was rejected. For the subjects in this experiment, the 
average siradapuve item required appioxinutcI> 1 1% jonger 
than the average linear item. 

Testing Costs. No foil cost analysis was planned for this 
study. However, compurer costs were available for the 
t^ace^y data collevlion. A total of S89.00 was spent over 
the entire period on the CDC 6500 computer. This total 
mJudcd vore memory (CM), ccnfral processor (CP), per 
manenl file storage (MS), data transmittal between the 



TABLE 6 



Comparison of Scoring Mcihod 8 Parallel Form Reliability 
with KR-20 Reliability Over Three Termination Rules Stepped Up to 50 Items 



Parallel 
Forms 



TERMINATION RULES 
I 2 



rj^(raw) 



(N = 12) 


(N-28) 


(N = 38) 


.892 


.688 


.732 


-929 


,806 


.903 


(N = 55) 


(N=55) 


(N = 55) 


.901 


.899 


,873 


.935 


-943 


.947 


-31-45 


26.47 


IT, = 19.2 



KR'20 



p^^(raw) 



= average number of items under termination rule I, 



TABLE 7 

Comparison of Validity Coefficients of Scormg 
Mctliod 8 under Three Termination Rules 



Termination Rule 
1 
2 
3 



N 
64 
80 
91 



J36 
.499 



'ex 
MS 
-693 
-626 



r^^ - Correlation between criterion measure il2V) 
cx 

r^^* = r^j^ corrected for attenuation 
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CRTsand the cumputcr.Itnt prinun^(LP).and punvh^rd 
output for 102 subjects. Data Tiles v^ctc pundied-out as 
they were crated to assure that data would not be lost in 
case of hardware malfunction. 

In the present siud>, 6 CRTs were kept on and tied tu 
the computer continuously for 14 hours a day for 3 diys in 
order to be ready for subject-volunteers whenever they 
arrived. In any institutional implementation of computer- 
testing outside the experimental situation, exam time 
would be scheduled, thus minimizing tclcphjnc line traiu 
mittai costs, 

Jht wOSt of awtuall> le^'^Jr.g ead; individual ^mc tulc^ 
than 2t per subjcwt for CM. CP, MS and LP ;imc. Ihc v-ast 
majority of the oj^u Jiied above inw^Ivc 42 hoars on 
continual tic in to the computer, the "unr»ccc$sar>'* punch 
ing out of all data, and the extensive file manipulations 
done b> the author bevause direwt access spai:e bcwame 
critically short during data aillection. The latter factor 
required resioragc of data files from dircwt lo indirect tile 
space. 

This wost approximation vouldbe compared wi^h testing 
vosts from the reader*s experience. Without trying to define 
con\t;ntional Jesting vOSts per ^e, there is slil! little doubi 
that computer based testing ^jsts less tlfan wOnvcntional 



testing with (tic papci and penwii mude K»i an^ laige-^^le 
testing program. 



CONXLUSIONS AND IMPLICAHONS FOR 
FUTURE RESEARCH 



The results uf tim Mudy favox lurihei invicbljgation of 
tlic Atiadapmc ic&tmg mudeL The madel pioduwed consis- 
tently higlier ^•alidity coefficients than conventional testing 
with a ^ignifi^nt reduction m the numbeicof items from 48 
to 31* 25 and 19 foi tlie three Mradaptive termination rules 
investigated in the study. The internal consistency reliabil- 
ity fwr the best stradapave scormg metliods was sigarii- 
vantl> hjgliei than the conventional KR-20 ej^nmatc, and 
the 6 ti adaptive paiallcl-foims reliability e^nmates were 
wonMitend> highei than convimuonal KR-20 esnmates. 

No prior research was found sli owing a companstxi of 
Item latency data between adaptive and conventional 
testing modes. Rc^ulb in this study clearly indicate that 
subject:^ lake significantly longer to answer items adapted 
to then abihty leveU about 1 1 l^^gerun the present study. 
This u> an important result, as it indicates that future 



TABLE 8 

Comr^ison otAvitzp: Number or I terns for Linear Test and Three Termination 
Methods of Alternate-form Stradaptivc Tests 



^SUBJECTS 



AVG ^ 
ITE.MS 



STDDEV 
i^lTE.MS 



# SUBJECTS 



AVG # 
ITEMS 



STD DEV 
# ITEMS 



LINEAR 



47 



48.43 

testF 

AVG it 
ITEMS 

31.46 
26.94 
19:20 



.99 



STRADAPTIVE 
Method I 
.Method 2 
Method 3 



if SUBJECTS 

55 
55 
55 



STD DEV 
ITEMS 

18.03 
16.76 
14.06 



^SUBJECTS 

38 
41 
47 



AVG # 
ITEMS 



30.92 
21.98 
18.19 



STDDEV 
ir ITEMS 

12.54 
13.10 
11.34 



TABLE 9 

Comparison of Distributions of I icm 
Latency Between Linear and Stradaptivc Groups 

CROUP # ITEMS MEAN # SEC/ITEM STDDEV 

LINEAR 2276 35.999 12.062 

STRADAPTIVE 1730 40,047 13.219 

Prifi sir = ;x lin) = < .001 
/V(ij^j5lr='a'lin)-<.001 
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researdi into adapm^ icsling of any kind should take this 
vznable tnio congdcralion when c%Talu3Ungan adaptive test 
strategy. Thenet gam of the adaptive mo^elua funcUon of 
the testmg time needed lo adequately measure a subject s 
ability, not the number of itemi presented to the subjcU. 
All pnor research reviewed tacitly assumed t^lat itcui 
latency was consistent across tesung strategics. Th*i study 
mdicated this assumption to be false. 

It IS recominsndcd thai future ^Uadjptive expenmental 
siudr^ should consider both stiadapluT brandling models 
with ft comparison of results fiom variation in the minimum 
number of items m the ceiling stiatunL A comparison 
between varuble numbcx of stjgc sUategics ^and fixed 
nomber of stage strategies is desirable. 
* As suggested m previous leseaxvh, adaptive icslmg nu> 
reach "peak" effiaenc> at between 15 and 20 items. A 
comparison of sliadapiivt: tesi statsStics fui example with 
k^^. 15, 20 and 25 items vwth linear testing should 
-iTivcstigate this hypothesis. OncC the stradapUve daU is 
collected under the variable sUategy. the fixed item 
statistics can be deierrmned by gia ding the stiadiptivc lest 
after "K" items and then "slartmg" the subject s second 
test at the Hrst item of the enuy point level. 

Following the same logic which led lu teimmation of a 
subject s tesung when five items in a row m the highest 
stratum had-been correctly answered, the nussing of fiw 
Items in a row of any stutum should piovide immediate 
ceiling stratum definition. The probability of this occui 
rencc would be less than .05 for a properly normed item 
pool. In the case of the present study, 13 of the 55 
stradaptive subjects would have terminated a stradaptivc 
test an average of 12.1 times earhei than termination 
method 1. with no effect upon the other 42 subjects. The 
resulting- stradaptive test staUsUcs oblamed from, the 
implementation of this suggestion have not been calculated, 
except that the change would have reduced the average 
number of items presented under termination method 1 to 
28.4 from 31.45 (9.7%). 

Further research is recommended into adaptive testing in 
which both the number of stages andslep^ize are variable. 
The Bayesian straTegies and Urry's model (1970) are 
examples of this category of adaptive measurement and 
further model development seems appropriate. 



Research is indicated with comparisons between adap- 
tive models as well as the traditional desigp of comparing 
adapti\'e methods with conventional methods. Wdss* on- 
going xesearch is beginning this ^^'ork, but more is needed 
The traditional comparison assumes that conventional* test 
sUtistics are the criterion that an adaptive testing procedure 
should try to duplicate. Lord, Green, Weiss and others have 
argued that improwd measurement of the individual at all 
ability levels may be hidden by the use of dassical <esf 
statistics such as vaUdity and even reliability 

One objective of this study was the attempt to estimate 
the degree to which the violation of the assumptions of the 
uneTacIox ANOVA model affected KR20 reliability esti 
mates. The assumption that items are independent of one 
anoAei is dearly violated in any adapuve testing pro- 
cedure. The extent of the effect this violation causes is 
unknown, yet most previous rcsearchin adaptive testing has 
only considered ANOVA KR-20 estimates. 

Tlic results from this study do not pernut definitive 
statements on this question. Nevertheless, the three KR 20 
estimates were con^stently higher than the 3 parallel forms 
reliabilities. Cleary & Linns (1969) Monte Carlo study 
indicated that r^o provided belter parameter estimation 
than parallel-forms reliability estimates, so one must ques 
tion whether the higher p estimates are not the result of the 
dependency between items. Perhaps the only way this 
question can be validly investigated is through a Monte 
Carlo study of adaptive testing with p known and the two 
methods compared, for estimating p. 

Green (1970) stated that the computer has only begun 
to enter the testingJ}iisiness._and t hat as experien ce:wi_tlL_ 
coinputer-conuolled testing grows, important dianges in 
the tedinology of testing will occur. He predicted that 
"most of the changes lie in the future . in the inevitable 
computer conquest of testing.**^ 

The stradapiive testing model appears to be cne such 
important change. 



^Crccn.B.F.. Jr.. In Hollzman (Ed.), p. 194. 
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USING COMPUTERIZED TESTS TO MEASURE NEW 
DIMENSIONS OF ABILITIES: AN EXPLORATORY STUDY 

CHARLES H. CORY 
h^avyPmonndResairch and Derdopmatt Center 



Because most of the rcscardi with computer-assisted test 
adnunistratioR has been conoerced uiih tailoring item 
c^iculties to test takers, what appeal to be important 
«iui2ct£7i^uc& oi vompuiciizcd equipment fui cApandini, 
(Smensionality of measurement appear to have been Iargc1> 
ignored. Since paper-and-pendl tests are limited in terms of 
stimulus vsntjol and re&punse mude, the neai cAclusive 
reliance on them for personnel sclccticxi has imposed 
restnctions on the types of abilities uiiich can be mcasured- 
For example, uong conventional paper-and-pencfl tests, it is 
difficult if not impossible to present a moving stimulus, 
obtain n;easures of tracking performance, control item 
exposure time, record response latences, or seqisence items 
as a function of pnor responses. Conq)uler ternunab of the 
type ordinanly used for programmed instruction do have 
these capadties. 

The battery of tests developed for the present research ^ 
has been especially desired to exploit the spedal 
capabibties of computet icifmnals fox pictuiial diA{^a> and 
movement and has thui been designated the Graphit, 
Information Piocessmg tGRlPj iene>. A majoi mteiest of 
the research wasm fmduigabihties ^ich aieimpoitant f<n 
on-job perfoimani« whidi computerized tests could 
measure accurately but paper-and-pendl tests could not. 

As a surlmg pomt foi the mvcstigation, five traits^of 
^'real world" sigmfitancc as defmed b> Mecham and 
McCcrmick Ki9b9) were selected. They wcie^Short Teim 
Memory, Perceptual Speed, Pert«plual Closure, Movement 
Detection, and Deahpg with Concepts/Information- Empm- 
cal data on the relative importance of these attributes for 
work performance is available from Mediam and 
McCornuck ll9o9;. The study- was designed to provide 
comparisons of computerized and paper-and-pencil tests 
designed to measure these attnbutes and to compare the 
computerized measures and the opeiatimial \ariables in 
terms of dimensionality and validity for job performance 
criteria. ^" \ 

The equipment used for the research consisted of the 
IBM 1500 system i^us a cathode ray tube (CRT)disirfay 
unit and a screen fot film presentation hnked on-line to an 
IBM ii30 computer. Subjects responded to visual stimuli 
presented on the CRT by touchinga target with ali^t pcn^^ 
or by enicrmg a response mto the typewriter keyboardr 
PlrogramriSng was carried out in Courscwriter. 



The GRIP Tests 

The CRIP, battery conastcd of dght computcr- 
a Jimmstercd tests, ca Ji dcagned to measure a major aspect 
of one or rnorc of the five job dements. 

. ffiiisuaave items fiom each of the GRIP tests are sho^n 
in the Appendix. 

\ i - 

1. Memory for Objects . Franscs showing line dxzvmp 
di common objects with simple one 'word names were 
flashed on the screen at an average exposure time of about 
one-half second per object per frame. Number of otgects 
per frame ranged from three to nine. After the exposure 
period, subjects typed in the names of all of the objects 
remembered. y 

2. Memory for Words, The test was identical in 
intention and arrangement to the Memory for Objects, but 
with words substituted for the pictures. Of course the 
object of this test was to compare the recall of wards ^ven 
with the recall of words generated by the candidates' 
recognition and labeling processes. Words were of two 
lengths: 34ettfr^and5'letters. 

3. V^l Memory for Numbers Test This is a dipt-span 
test using the same type of mcthodolog>' as was used for 
the two preceding tests but having digjts as stimuli About 
50 percent of the di^ts were presented sequentially and the 
other 50 percent were presented all at once, as a arijje ■ 
stimulus. 

4- Comparing figures. The frames of this computerized 
measure of perceptuaPspeed contaiii sets of squares or 
drdcs presented as rows, vertical columns, and right and 
lert slant columns. Three to six stimulus pairs are shown on 
the screen at a time. Each stimulus has a crossbar, oriented 
either vertically or horizontally. Subjects are asked to 
record as true-false answers whether ot not all aossbars of 
corresponding pairs in a set have the same orientations. 

5. Recognizing Objects. For this computerized closure 
test partially blotted-out lectures of common objects are 
presented. The first presentation shows 10 percent of the 
area and more area is added in random increments of 1 0 per 
unit until 90 percent of the picture is exposed Subjects 
enter the nanws of the stimuli on the keyboard. 
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6. MawQ^ for Patterns^ A test ^Scsipcd io injure 
movtment deicction abiliucs. in niii Ji patterns Ait fi?!mcd 
by 5equcntt2U> biinJang dots. Subjcos aic ^sked lu icpcai 
vAie^htt or not iwo 4^;nscc32iivT patterns ztc idaxUul A^d 
for odiec iienc they are asted to reproduce pven patterns 
on the CRT wiiha ilfibt pen. 

7. Twcbre QuestionL A icsi whiA rc^robles the 
Twenty Questions poe xn that sotjctis art ^ked lu gucsi 
the nan^ of an object based 4^ y cs-nu^^^w AuppJied b> 
the compuiez to questions. It differs from T«nt> 
Questions in that the qi:cstioos ^rc suppSed xn ihe test 
father than being pescd by iht subjtwU The Aubjcwi*s 
objectives are to sclevi those qucsuuns vAudi piondt iht 
quickest ideniifitation uf the ubjevt ^nd tu a^vid questiuns 
which are redundant ut useless. Svoies ^e sums uf «^iewt 
responses wei^ied b> Auihbci and v}iaiawlei*sLv^ 
dues received. 

8- Password, A test **hi rcscnsblcs the regubi 
'•password'* game in ihai sets uf wuids are shown or* iht 
CRT which suggest a targci woid. Five separate s^vids ait 
shov^n as dues. After the first twj Jues and eai; ^ 
succeeding one, the n^ me of the objeU ma> be i>ped on 
the keyboard. Scores are sums of vorrevt responses 
weidited by number of dues received. t 

9. Latency and Accuracy Varidblei. In addition to 
direct measures of the personal attributes, latenw> measorcs 
were computed for speed of response foi the Memur> fos 
\Vords and the Comparing Figures tests and btenw> uf 
Rccpgnmng Object responses (speed of dosure). In 
addition a measure of the total extent to which the 
response patterns faded to duplicate the stimuli in Memory 
for Patterns, free response was aeated (PAT-ERR). 

Paper-and-P^nca Experimental Tests, Biograp hical Van- 
ables, and Operational Tests 

Together with the GRIP battery, ei^t paper-and-peacil 
tests Urgely drawn from the ETS Kit of Reference Tests of 
*Cognitive Factors (French ct a!., 1963), and a motion 
picture test (DriftJMrection by Gibson, 1947) composed 
the set of experimental tesu.^'In addition, dau for each man 
vrcre obtained for tVo biographical variables and for the ^ 
nine tests which are routinely administered and used for 
Navy personnel decisions. 



Samples 




The cxpenmental battery was administered to students , 
at the Navy Training Center, San Diego, durmg May and j 
iuoe of 1972. Subjects were chosen from personnel m the ^ 
llrst two weeks of technical training for three ratings hawng 
widely vaned duties. Also tested m oidei to maease the 
samite size were recruits m their final week of training who 
were school eligible but haS not yet received post-reauit 
assignments. 
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Ten to eleven months subsequent to &e testing, after 
the subjects had seivti *>n jobs in the Reel for sevrfd 
;Donshs. supernsor> ratings ,,owing both dobal and job 
dement aspects uf on^ob performance were coBected by 
mailout questionnaire. 

The questionnaire t»ed w« an adaptation of the 
Position Analysis Que: ionnaire, a broad-based empirically 
dcnvcd mstruroent de«lop^ by E. 1. McCornack and his 
a^dates whidi has been cxtennvdy used for job 
das^ication research 0>icCormidc, Jeanncrct, and 
Mediain, 1972). The adapted questionnaire was used to 
vA^a ratings m ^obal performance qs well as perfomanoe 
on an uf the 42 job elemenu »hidi ime judged by a panel 
uf Guef Pett> Officers to be relevant to the positions 

Aftei a prelinHiiar> rcwei* of the ^^jcstionnairc returns, 
the 22 job elements having the lar^t representation in the 
sam^ie were selected for analysis. Ihesc 22 job elements 
together with the sample ^ for cadi rating for each Job 
element are shown in Table 1. For Instance, the first rating, 
QecUician s ^fate, im^^ved Manual Control Non preciaon 
Tuuls, Assembling-Disassembling, Hand Arm Manipulation/ 
Gnndinaiion, etc In contrast the Personnelrran rating 
required Using Written Materials, Compiling Data, Oper 
aling Keyboard De«ces, Pcrsuading/hifluendng Others. 
tw^ and the Sonar Tedinidan rating required Uang 
Rwtodal Materials, Uang Visual DispJays, Adjusting 
Machines/Equipment, etc. Ihc last group consisted of 
personnel in undifferentiated ratings, largely apprenticeship 
ratings. Major aspects of the assignments of this group 
involved Uang Spoken Verbal Communication, Manual 
Control Non-prcdsion Tools, Attention to Deuils 
Completing Work, Worldng with Distractions, etc 

For eadi rating separately, zero^rder v-alidities of the 
tests for^ supervisors' marks of the job elements were 
computed and comparisons were made to identify the 
predictability patterns of attributes for job elements and to 
compare the operational, experimental paper-and-pendl, 
and experimental computerized tests as measures of these 
job elements. Similar types of statistics were computed and 
comparisons carried out for the ratings of ^obal job 
performance. 

RESULTS 



, Most of the statistically significant zero^rder validities 
of the operational variables were found for the 12 job 
elements which are shown in Table 2. The predictor 
variables on the left are the Armed Forces Qualification 
Test, OCT a test of vocabulary and* verbal reasoning, ARI, a 
test of arithmetic reasoning, MECH, a test of basic 
niecham'cal knowledge and prindples, CLER. perceptual 
speed, SONR and RADIO, memory for pitches and sound 
patterns, ETST, electrical knowledge and mathematics, 
SHOP, Tool Knowledge, and lastly years of educatlcm. 



TABLE 1 

San^k Sires fox tht TwciOy^m Most Conson Job £k:ntat$ 





£M 


PN 


ST : 


UA 


C:bx^ Wcttm M^tcdaH 




4S 


30 . 
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Us2^ Fktons] Maxexiah 


20 




32 


66 


Ufliy Visual Bxspla^rs 






35 


Uszsf Spoken Vezbal Conaaujucatioc 


20 


52 


36 


92 


Ujiaf KoA^zbal Soun&i 






31 






20 








CompQiof DaU 




49 




SO 


Masoa! C^ntrol-Non-pitdson Tools 


27 






Xafitial Coritrol-PxtdsK)s Tools 


23 








Opetatiz^ Keyboard Deiices 
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Adjttstiii( M&chittcvHqttQKneat 


23 




29 




AssemblzafDisaisemUlng 


27 








HiDd-Aim Marnpulatkm/Cocrdinstion 


22 








Hacd£ar Cooxdiaation 






31 




fcxsoadir^ Inflacndng 0 Ihm 




40 




69 


Exfhampflg Routine IxiTonn^tion 




51 




UnasaaUy Good fitciseon 






29 


69 


Atttstjon to Detais, Cos^ktic; Work 


25 


51 


36 


102 


Vag2lasc&€6ntinualiy Oiangmg Details 


20 






78 


Ceding withTsne Piessuxc 


22 


49 




Woridng with Distractions 
Ketpisg up to Date 




48 


30 


84 




52 


86 



TABLE 2 

Sipuncant2ero<)xdexVa!idi!ksof tkeOpe»dc^ Vambks 
foxTvrehe Commoa Job Elements 



frsllctor 
UtU\U Utiz$. 


5eh £I«««st 






VsrtAt 


Tools 


AiJ«ftli4 




U?or- 


&>od 
7r«elalo3 


AtttdSioa 
to 


tfsrUct «ltb 




ATQI 


St 




1 








1 1 - i - 










CC7 


n 

VA 








50* 
22* 


pn 






2;« 












51 








24« 


_ 








27* 


23* 






KlCI 


w 


i 






i 


- 1 










3«* 






oxa 
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30** 




2S* 




» 

CA 










m 














37* 

-26* 




M 

ST 
CA 








22« 












37« 




39* ' 


ETST 


CA 
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21* 








24* 
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AS* 






42* 













VA 








St* 


22* 








2A*« 







C«lt K« 
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« 1 


1 15 ] 


21 


IB 








It 








n 




^ 1 


29 i 27 






20 


se 




2> 


27 


31 




ST 


29 


30 


33 1 34 




27 






27 


34 




2S 




» 


if 




6$ 90 


19 






*7 


67 


too 




e4 



t^iml r«lfttA w*r« o«Xct«4 fro» vtlUlt/ co«f flclMta. 

CMffUlMCs algiii/ic^t «C 2 < .CS m4 £ .01 hava ht** S^Mtlf l«d ^ «Utt« «o4 MkU cauf lakf, ratHCtlvalj. 
A call UiicAttM M^it»iilcMal vall^ltj. 

A 4o*%U k7>k« iailcata* sJaa(a( 4m, 
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Only the stausUca!I> sipuficani cocfiidccls are ihuwa. 
Hie level of si^mTicance is indicated by a sin^e andedine 
foe the five percent Itvd and double undeiunes fox the ott 
percent leveL Blank cells indjcaic noo-si^nificanl ralifitics 
and double dashes indjcalcd thai the Ni %w toosniail fo! 
>ahdity coefnaents to be u^mputed. Roivs foi inditidual 
latinfS «iuch did not have an> 5tatistican> signifiuant 
yali(£ties hav: been oizitted. 

Opciational variaWes wre generally not effective for 
predicting perfamnnce un job decjents in the technical 
ratings^ and «iiere effective did not seem to be assodaled 
Tvith underlying lebtion^p^ ai coostructs. Fox instance. 



the writing abilities of STs do not appear to be lo^cally 
rdatcd to scores on ARI and RADIO, bat they were 
signi&i^tly correlated »ith them SiniJaxly, the reasnas 
fd the significant rehtiooships beiwen RADIO and 
KUorial Materials, SHOP and Verbal ConununicatiOD 
abilities, ARI and Communicating Routine Inforoiadon 
MECH and Influcadng Others, and CL£R with \witingand 
wrbal communication skills were not dear. Yc» 3D <rf these 
relationships were found. 

On the other hand Interj^etatiiHi of th? sipiGcanf 
prcdicios job dement vaHdiiies Js^mudi more lo^cal and 
tunsistent for the experimental lesU (Table ?> 



TA3l£3 

Spiificast ZeroOnSer VaBditics of the Exprrunmtal Variables 
for Twdve Common Job Elezaests 



7r»<ictor 


Jitota rials 


flct«rl«l 
MatarUla 


TXa««l 


r ■■111- 


1 TC«1« 




r=fi»- 

MclaiC 
Ottera 


la«tU« 
MCtoa 






1 

wrklj^ vltk 


1 

1 






— 




— 










— 


— 




— 






IS 












— 
















—222 








- 














tkm» let 
JIM. CO 


1 7JI 
S? 






— 


1 — 








- 






33* 


Mm« It 


7% 

VI 








12** J — 










-42* 








tk 

i ST 




— 










— 










rr . 


ru«- 


Vk 

n 








20* 
















26* 


C«ac. 


m 










-40* 










-42* 









\ tH 

n 

1 ST 










-3fc* 














i: 




I'ST 


























Mm. fox 


F 

! w 
























































n 


r» 


























n 

5T 




























n 

ST 












44* 








30» 








ZH 


























CIO-UT 
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ST 


45 

71 


3X 


34 


20 
4t 
36 
92 


Ca] 

27 
•0 


LI Ma 

25 1 

2f ' 1 


37 


47 




25 
47 
3« 

102 


44 
84 


4» 

30 
W 



D«ciml Hl*t« vara Mlct«4 fro* cW valUlc/ coarruUats. 

Ccarrielaata slyilf luer «t £ < .05 «b4 £ < .01 fcara Uaa li^itiriW ala«la ^«*l• ftstcrr«ka, 
A WaA lallCAta* malgairisaat yalUlcy. 
A 4a«^l« tiyphOT (— } latflcaua siaalaf 4atJ. 
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The fim five tcsu are shuit team oiemuiy ic&!& c^ith the 
first test being the ETS Kit lest of Assodativc Mcmoiy, the 
oexl three bcuigcunipiicrucd incim/r> tests the last at. 
audituril> admzmsteied incasurc uf d:^t ^pan. Inteicstm^^ 
the me}noi> tests Aov^ cunsistent nc^Uvc vunelatiuns 
with job denicnts fyi Elcctnuan'^ l^Iate and the Appitn 
ticeship group and pusiUve v^nelatiuns fur Sonai TeUiiu 
aan and Petsonnchiiaa. The curiciatiuxis fua PNs arc lui 
Writing and Verbal G^mmunication SkSIs« two job elements 
for which it would be logical to expect poatr>'e conela 
tions. 

The next two tests. Counting Numbers and Comparing 
Figures, are respectively papcr-and-pendl and computerized 
tests of perceptual spctd. Both tests discrirmnate primarily 
for Persormdmen and the Apprenticeships ratings and the 
patterns of vahditics of the two tests were very similar. 

The next three tests, together uith CLO-LAT, measure 
perceptual closure, Gestalt Completion and Hidden Patterns 
were from the ETS battery, and Recognizing Objects and 
CLOLAT were computerized measures. The tests have 
negative validities for Electrician's Mate and podtive validi 
ties for Sonar Technician, with primarily visual types of 
elements being predicted for the latter rating 

The next test %vas separate pails of the cumputeiazcd 
test designed tu measure movement detevtiun. It had 
significant validities for Sonar Technician and also had 



Significant tahJiUcs fui Petwrmdmcn and Apprentice 
^prating group. . 

Nonsense SjC^gisms and Infer envc, ^ncasuies v^f s^llu- 
gisiiw icasomng Uom the ETS balter>, ar>d tl«e next t*w 
tests, 12 Questjons ar«d Password, are cvrmpi^tciized vari 
ables hypothesised tw measure the same tjpe 4jf abilit>.Foi 
Persunndmcn both Inference and 12 Questions wierc 
signiiscantly related tu jub pcrfurmance and the patterru uf 
significant validities were %*ery similar. 

The four special variables at the bottom of Tabic 3 
correlated with \isual skUls and with job dements involidng 
accuracy and predson. 

These relationships are summarized In Table 4 which 
shotvs the number of significant validities of the opera- 
tional, experimental paper-and-pencH, and expeririiental 
computerized variables for the job elements in each rating 
in which they were present 

Major areas in whidi the computerized measures were 
useful predictors were Adjustmg Equipment for Electri* 
oan s Mates, Wntmg and Workmg with Distractions for 
Persoimdmen, and Visual Displays for Sonar Technicians. 
In addition computerized measures were useful supple* 
mental predictoits of communication and mterpersonal 
relationships skills for Persormdmen. Thus, the computer- 
ized tests predicted job elements which would he expected 

be ventral to ^obal'perfoimance for the Persoimelman 
and Sonar Technician ratings. 



TABLE -1 

S^niiicant Zctx>Ordcr Validities of Operational and ExpcrimcGtal 
Variables for Twelve Common Job Elements 
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TABLES 

Zer(>OTder V2djt!2desof Expoimcaul V^mbksforCkibaa Pofonniacc 



Validity 



m ST VA 



SiortTcnn Mcraoiy 



Object Number 
Mcrooiy for Objects 
Mciijoiy for Words 
Mcnwy for KcmbcrsfVj 
Mcnioxy for KumbcxsCA I 



GotJOtxnsKumbm 

Compaiinj Figures* Machin6^accd 

Comparing Fi^cs. Self-paced 



Gestalt Completion 
Conocakd Words 
Hidden Patteras 
Rcoosnmi^ Objects 



Dnft Dixection 

Memory for P^tlcms. Tfuc-fdsc 
Mcmoiy for Pallems, Free Response 



Nonsense SyUogisms 
Inference 
Twelve Questions 



WORD-LAT 
CLO-LAT 
FJG-LAT 
PAT-HRR 



--26 
-.16 

-AS 
-.15 



m 

.06 



-.28 

-.37^ 

-.04 

-.11 



-.29 
.15 
.19 



.13 
-i)3 
.20 
.20 
.17 



Percepted Speed 



Closure 



-.10 

Xi7 



-.26' 
--14 
.23 
-.06 



Mov£3nent Detection 



.07 
-.07 
.21 



Dcalii^ with Concepts/Informatics 



-.30 
.18 

-.20 
-08 



-.24 
.05 
%04 
-.24 



.01 
.19 

.13 



Spcdal Variables 



-.06 

m 

.00 
-.17 



-.03 
-J05 
.13 
.38* 

J22 



sn 

.21 



.28 

33 

as 



m 

03. 



30 
M 
J2l 
.33 



-.05 
-.24 
.02 

-J26 



-.01 

-J07 
jOI 

-JOl 
JOS 



j06 
-j06 
j08 



.06 
--10 

.11 
-J05 



j06 
J07 
.19 



-.06 
.13 
.11 
.04 



-.11 
-.11 

M 
-.13 



•Significant atp<05. 



Zeio-uidci vaijdiUcs «>i the eAp£iixncnt4iI variables fui 
the global lahn^ jub peifoniianvc din. ^h^wn in Table I. 
Nine uf ihc >2 validjtjr vMcfficients (iO pcrweni) acic 
^latistiuall> signifiwant. Of the rune, iivc were f^i wi/mpuici 
izcd tests. Musi *j( lite signifjuani »<ihdiijes ^cie Sui SuHai 
Te^hniwians. In c^jiif/aiisun, five uf 35 tfaLditics of ilic 
upciatiunol tests were statistically sigriifiwanl (Table 
which three were for the OA group. 



Thus, vanables in the operational batter> were best for 
piedict^g global performance in ap{Hentiocshlp ratin^^ 
whereas those in the experimental battery were more useful 
fui predicting performance in lechnica! ratings, and were 
paiUwularly guod for predictitig th& performance of Scnar 
Tcwhrucuns. Personal attributes having the hi^est numbers 
uf significant vahdities were Movement Dctectiim and 
Dealing with Concepts/Information. 
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TABLE 6 



Zcio^dcr Vilsditjcs cf Opcntkmsl 
Vimbks fox Globs! Feifoimaxice 


Fftdictor 




KatiJig Group 
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-J09 
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CLER 


-21 


-.15 


•11 


.19 


SONR 


-.08 


•15 


--08 


-i)3 


RADO 


-.06 


.11 


.15 


.15,, 


ETST 


.16 


31 


-.09 


33 


SHOP 


.20 


.33' 


-.21 


.17 


YRBI 


-.12 


.06 


i)l 


-.11, 


YRED 


.11- 


i)5 


-i)2 • 


32 



Complete ditz were not 2raibbSe for somt of the tests. 
•Sismficant at p.< .05. 
••Si^ifiont aip < .OK 

TABLE 7 



Optimal Fredictnre Compo^tes for Global Ferfonnance cf EIectiicUn*5 Mates 
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Test Scores 












•Complete Set of 
Experimental and 
Operational 
Variables 


37 
.49 
.58 
.65 
.71 
38 


.00 
.20 

.28 

34 
.40 
-53 


Concealed Word 
CLER 

Drift Direction 
PAT-ERR 
Memory for Words 
YrBi , 


• -.40 
39 
-.28 
-30 
-v40 
-36 


27 



Multiple regression statistii;s foi opUmal scu of the 
operatiuiul and expertmenial variables foi Elevtnwi^fi*s 
Mate are shown in Table 7. 

The first super row shows statistics for the optimal 
predictive composite for the eleven operational scores and 
the same type of statistics foi the complete battery of 
operational and expcnmcntai variables are sKc^wn in the 
second supci row. The second column contains the shrunken 
validity coelTident for each predictor selection step. Addi- 
tion of the expermicnlal tests to the battery increased the 
expected cross vahdity substantially althou^ the sample 
size is so small that these figures should be interpreted vnih 
caution. The negative beta weights for PAT-ERR and YrBi 
are artifacts of the direction of scaling for those variables. 



The same type of finding was characteristic of the 
ptediutive composite for Personnelman (Tabic 8). Ag^in the 
negative validity of WORD-LAT was an artifact of direction 
ofscatog. 

Foi Sonai Technicians (Table 9) inclusion of the 
expenmenUl tests in the battery added 38 poinU to the 
shiimkcn multiple vonelation. All of the variables selected 
foi the complete set were measures of peivcptual ty pes of 
abilities. 

On the other hand, the experimcnul variables added 
almost no inaemcnt io the expected uoss validation fot 
the Apprenticeship group (Table 10). 

The usefulness of this type of expansion of coverage of 
the battery may be illustrated by reference to the abOities 
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TABLES 

Opt223al Pitdictive Compostcs for Global P^xfonnanceof Pcnonnelmen 



R 





Determination 




Expected Qoss 
Validation 


Predictor 


BetaWel^t In 
Final Composte 


N 


Optntions! 

QassincatSon 

TestScofss 


.38 




.12 


SHOP 


J8 


30 


Cbjsplcte Set of 
ExpciimcntjJsnd 
Opt rational 
Vamblcs 


.38 
.47 
.64 
.71 
.80 




.12 
.20 
.46 
.52 
.65 
.74 


SHOP 

Gestslt Completion 

GOT 

RG-LAT 

TORD-LAT 

Mem- for Patterns; tX 


.22 
-1.19 
1.40 
.69 
--.40 
.37 










TABLE 9 










Optimal Prttdicti\'e Composites for Global Pcxformanoe of Sonar Tcdinicxans 










R 










PrtdktorSet 


Dcterminaticn 




Expected Cross 
Validation 


Predictor 


BetaWdpit in 
Final Compo^te 


N 


Operational 
Qassification 

ICSI dCOfcS 


.38 . 




.22 


ARl 


.38 


37 


Complete Set of 
Experimental and 
Operational 
Variables 


.42 
.54 
.61 
.66 
.73 




.28 
.40 
.46 
.50 
.58 


Counting Nos. 
Mem. for Patterns, tf. 
Nonsense Syls. ^ 
Recog* Obji 
Gestalt Completion 


.33 
.32 
.29 
.33 
.32 










TABLE 10 










Optimal Predictive Composites for Global Performance of the Apprenticeship Group 








R 










P/edsctor Set 


Weight 
Determination 




Expected Cross 
Validation 


Predictor 


Beta Weight in 
Final Composite 


N 


Operational 
Qassification 
Tcjt Scores 


-33 




.28 


ETST 


*.33 


111 


Complete Set of 
Experimental and 
Operational 
Viixiablcs 


.33 
.37 
.41 




.28 
.29 
.32 


ETST 
CLER 
Concealed Word 


.33 
.21 
-.19 
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which arc being mcaiuxcd bv die ekmcnU m eaJi of the 
four pxcdiclai cumposiici elected. Thus* fox £M to ihc 
Perceptual Speed measure in the operational battery ^'ere 
added Closure, Movcmni DciecUon, MenxH>, and 
Accura^ of Spatial Pcrccpuon froru the experimental 
battery. For Pcrsonnclman. tu ihc TcJinival Knowledgt 
component* v^iiich prowdcd the pnmsy piedictiveness in 
the operational battery, wrc added measures of Gosure, 
Speed of- Response and Memory from the experimental 
battery. For Sonar Techmcian, to the general mental ability 
component m the operational battery ^vcre added measures 
for the Movement EteiecUon and Qosure components from 
the cxpenmcntal battery. And for the UA group to the 
measures of TcJinital Knowledge and Perceptual Speed 
from the operational battery w'as added a measure of 
Qosure lium the c.xpcnmcntal battery. With the ext;cption 
of the Closure mcasuies, sunic of whiwh were ppei and 
penal, must distinctive predictive validities fium the 
expenmcntal battery *veie supplied b> vvmpulcx adminis 
icrcd tests. 



DISCUSSION AND CONCLUSIONS 



It IS dear that the cxpenxncnial batteiy represents an 
increase in the breadth of abilities covered beyond those in 
the operational Navy battery, a considerable amount of 
which is attnbutable to the GRIP tests. Computer tests 
apparently provided measures of several attributes which 
were different from those measured by papcr'and-pencil 
tests. Furthermore, the measurement expansions of the 
experimenul battery served to supplement the measures of 
the operational battery to produce substantial increases in 
global-validities. 

Ihe i/nique measurement characteristics of the GRIP 
tests appear to be as follows; 

1. Computer adxiunistranon of tests of short term recall 
tising a variety of stimuli^is feasible, and appears to offer 
advantages in ease of data collection arid processing over 
paper-and-pencil tests measuring the same attributes. Fur- 
thermore, use of computerized tests to eliminate the 
expensive and time consuming hand scoring required by 
paper-and-pencil tests of short term memory would make it 
feasible to routinely measure these skills during personnel 
classification testing. Cumpulcrizcd measures of this attri 
bute were found to have significant positive validities fui 
several job elements, particularly for those dealing with 
communication. It is probable that use of the tests for 
other occupations would identif> additional relationships 
which are useful for personnel classification. 

2. Computerized administration of perceptual speed, as 
carried out in the GRIP battery, was only marginally 
different from papex-and-pcncil measures of perceptual 
speed. Since these measures did not offer any substantial 



impiovements In validities over paper and pencil measures, 
the initial judgment on their usefulness would be negative 
3- Further research will be required to clarify the 
ielatiunships between computerized and paper and pencO 
measures of Qosure. liidden Patterns, the best of the 
paper and pencil tests, had significant validities for Qectri 
dan's Mates, Personnelmcn, and Sonar Technicians. The 
pattern of validities of liidden Patterns for Sonar Techni 
cians wzs duplicated by CLOLAT, a measure which can be 
administered and scored automatically. 

4. The two experimental tests designed to measure 
Mowment Detection wtic not dosely related to one 
another and therefore did not prowde ewdence of a 
Movement Detection favtoi. Instead these tests loaded on 
memory factors. Perceptual Speed, and perceptual Qosure 
On the oilier hand, of the measures, Memor> for Patterns 
piovcd to be vei> useful particularly as a predicts for both 
spcofic and generalized performance of Sonar Technidans 
Foi the Oeancian% Mate and Personnclman ratings it 
proved to be useful at a somewhat lower le\'el. 

5. Facility in Sequential Reasoning was apparently an 
ability which was uniquely measurable by computer- 
administered tests. These tests demonstrated widespread 
and generalized validity for Personnelman and incremented 

' the predictability of communication and interpersonal 
relations skills ovei that available from paper-and-pendl 
tests; 

It is believed that the initial results with this technique 
are promising and that further dewloprnent along these 
lines is v^-arranted, particularly for jobs ^^hich require 
attention to scopes. Consequently, research to be carried 
out during Fiscal Year 1976 wll be concerned with refining 
measures of Movement Detection, Sequential Reasoning 
Perceptual Qosure, response latendes, and accuracy of 
spatial perception, together with the construction of tests 
for other abilities which appear to be potentially useful for 
personnel selection. Also, we hope to convert one or more 
of the tests to a branching mode designed to tailor item 
difficuUies to candidates. 
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ATPBKDIX 

lUJOsmaivE ims from tin Eisfr coKpyrERiiEO ibsts 




2. KSHOay FOR WORDS 



&!N MAN OWL 



PRIZE IVORY TABLE 
STOVC MUSIC SOUD 



FJR TEA HAT KID ART 
EYE CAT RIB BAT 



3. VISUAL KcKORY FOR NUMBERS TEST 
2 S I 6" 
124956387» 



COKPARIKG FIGURES 



B 


B 


CD 


B 


CD ' 


CD 



e©© eoQ 



0 



0 
0 



Q 
0 
O 

® 



e 
© 

® 
® 
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90% tO% - 60% 




90% 80% 60% 



6, MEMORY FOR PATTERNS 



XXX X S X 

ir X Ji X X 

X X X X z X ^ 

X X X X X x^'x 

X X X X X ^ X 

X X X X X^X X 

X X X X / X X 



X X X X X X X 

X X X X X X X 

X X X X X X X 

X X X X X ^ X 

X X X 

X X V X < 



X X • X ♦ K « 

^'^X If^X X ^ « 

X X jc XX Ax 

t d ^ 

XXX ^^x-«x.«# 

l^X X J X « X 

X X X I X X X 

X X X j X X X 



7. COMPUTERIZED 12 QUESTIONS 



Mineral 

Frequently Itrgtr thtn a 



glove 



1* Is it often used t5 clothinx? 

2* Is it Mkde of a soft material? 

3* Is it often ust<5 at »tals? 

4* Do people often wear it? 

5. Dots it havt Bovinf parts? 

6« Does it hava a hard surface? 

7. Is it always found on an auto? 

8. Is it Bade at least partly of flass? 
9* Does it have aore than one use? 

10. Does it use eleetrieity? 



11. Is it soaetives used by nagicians? 

12. Do sen and woaen use it equally often? 

13. Is It often used before a person goes out? 

14. Can one use it with his eyes closed? 

15. Must one touch it to use it? 



16. Does it appear dark in the light? 

17. Can it be used to seivd aessages? 
II. Caj) it laprove one*s appearance? 

(Mirror) 



\ 



OOKPUTERIZEO PASSWORD 

Metal 
Finger 

Soaring 
Eablea 



CircU 
Feathers 



Shiny 
Large 



Wadding (King) 
Bald (-^t^'^y 
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A BROAD-RANGE TAILORED TEST OF VERBAL ABILITY 



FREDERIC M. LORD 
k'ducational Testing Service 



This report describes briefly a bruad-range tailored test 
of verbal ability, appiopriatc at any level from fifth grade 
upwards, through graduate schuul. The test Scurc placo 
everyone at all libels directly on the ^amc^core sv^Ie. 

In a tailored test, the items administered to an mdividaal 
are chosen for then effectiveness fui measuring him. Items 
admimstered latei jn the test aic selected by computei, 
accurdmg to some rule based on the individual's 
performance on the items administered to him eailiei. 
Improved measurement is obtained 1) by matching item 
difficulty to the abihty level of the individual and 2) by 
using the more discriminating items in the available item 
pool. The matching of test difficulty to the individual's 
abiiily ievel is. ad vaniagcuus and desiiable foi psy ^hologiwa! 
reasons. For references on tailored testing, sec Wood 
I1973j. Also Chff 11975J, Jensema (1974a, 1974b), 
Fullcross (1974), Mussio ^1973), Spmeii and Hambletoa 
(1975). Urry (1974a, 1974bJ, Waters (1974), BcU and 
Weiss (1974), DeWitt and Weiss (1974), Larkin and Weiss 



(1974). McBride and Weiss (1974). Weiss (1973. 1974), 
Weiss and Beiz (1973). 

The bruad rangc test consists of 182 verbal items These 
ttcie chosen from all levels of Cooperative Tests' SCAT and 
STEP, fiom the College Enuance Examination Board's 
Piehmiaary Scholastic Aptitude Test, and from the 
Giaduate Record EAammation.The choice ^as made solely 
on the basis of item type and difficulty level. There was no 
attempt to secure ihe best items by selecting on item 
discriminating power. 

Two paiallel forms of this 182 -item tailored test were 
consiiucted. Only one of these forms is considered here. 

Ideally there should be only one item type in each row, 

thai all cxanunees would take the same number of items 
of each type- The arrangement of Table 1 is an attempt to 
approximate this ideal using the items available. (Few if any 
hard items of types a and e were in the total pool, also few 
if any easy items of types b and c. Types a and b, also types 
c and e, seem fairly similar.) 



TABLE 1 

Broad-Range Verbal Test Items Arranecd by Difficulty Level and Serial Number, 
(a.b, c, d, c represent different verbal item types.) 



Item 
Serial 
No. 

1 

2 

3 

4 

5 

6 

7 

8 

9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 



(casy)-^ 



-Item Difficulty Level - 



-^hard) 



Grade Level: IV 
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VI vn 


Vlll 




XII 
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a 
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a 




a * r 


b 
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< 
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c 
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d 


d 


d 




^d 


d 






e 
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c 




c 


c 
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d 
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a 


a 


a 


a 




b 


b 






c 


c 
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c 




c 
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d 


d 
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c 
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c 
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d 
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c 
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d 
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d " 
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c 
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c 
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d 




d 
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a 


a 


b 
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b 
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e 
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c 


c 
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d 
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d 




e 


c 


c 


c 


c 
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■ d 


d 
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b 
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b 
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t 


c 
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The 182 Items m a single furm i^f Uit test are 
represented m Table L v^herc thc> are ^rran^ed in «.ulunins 
by difTuuhy Icvtl. An iiidindual amwcis juU one item ui 
each ri^v of the (abh. a lutal i»f just 25 items. Theie are 
fuie verbal item t>pes, denoted by a.b.wd.e. Withan^aJi 
Item t>pc, the items in cawh wolumn anan^d m oidc; 
ufdiscnminatmg pu^ci v\xth the best sterns at the top. 

The examinee staits v^iih an item m the first io'a. The 
difficulty iewl of this item is deteimmcd b> the examinee s 
grade level, or some othci rougii estimate of hu abihty. If 
he answers the first ileni cuiiccily* he next takes an item in 
the second row that is haidei than (to the right of) the first 
item. If he answers the Hrst item incorrectly, he next takes 
an Item m the second low tliat is casiei than (to the left of) 
the first item. 

He nia> vontmuc v\»th the third and subsequent iows« 
moving to the nght aftci each woriwwt.ans^ei, oi to tlic left 
after each incorrect answer, until he has at least one nglit 
answer and at least one vvrong answei. At this point, the 
computer uses item vharaaeristic curve theory to compute 
the maximum likeihood estimate-ofthe examinecls.^Mllty 
level. In effect, the ^.omputer asks, foi v^hat ability level b 
the likelihood of the observed pattern of responses at a 
maximum, taking into account the difficulty and other 
vharacteristics of the items administered up to thb point? 
The ability level that maximizes this likeliliuod is the 
current estimate of the exaniinee*s ability. 

From this point on, the next item to be administered 
will be of the same item type da the item in the next row 
that best matches m difncult> the e^minee's estimated 
ability level. Gn-en this item type, we survey all items of 

n 
o 
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< 
111 
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111 



00 
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Grades V 
f 



VI VII 



VIII iX 
_1 U 



this type and administer next the item that gives the moit 
information at his estimated ability leveL 

Aftci cdJi new response by the examinee, his ability L 
reesumated. The item type of the next item is determined, 
as above, and the best item (not abeady used) of that type 
is vhosc:4 and admmistcred. This c^ontmues until he has 
answered 25 items, one for each row of the table. The 
maximum likelihood estimate of lus abihty detvimincd 
from his responses to all 25 items is his final verbal abilit) 
Score. Ac^rding to the item Jiaracteristic wurve model, all 
such scoies,for various examinees, are automatically on the 
same ability Svalc, regardless of which set of items was 
administered. 

About thirty different designs for a broad range tailored 
test of verbal ability were tried out on the computer, 
adnunistenng each one to a thousand oi so simubted 
exarmnees. The final design was recently chosen and has 
not yet been implemented on the computer for 
administration to real flesh and blood examinees. 

Consider first the effect of the difficulty level of the first 
item, admmistered. The vertical dimension in Figuie 1 
represents the standard error of measuiement of obtained 
test ^core on the broad-range tailored test, computed by a ^ 
Monte Carlo study. Each symbol shows how the standard / 
ciior of measurement vanes with abduy level (horizontal 
axis). The foui symbols represent the results obtained with 
four different starting points. The points maiked * were 
obtained when the difficulty level of J\xc first item 
admmisteied was neai 1.0 on the honzontal scale-about 
fifth grade level. The small dots represent the results when 
the difficulty level of the first item was near O~about 
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Figure 1. Tfic standard error of msasuremcnt at 13 different 
ability levels for four different starting points for the 25'itcm 
broad-iangc tailored test. 
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nintl^gradc \s\tl. Foi the hcxagum, a ass ncai 0.75 neai 
the 2\-crtgc vcibal ability level of college ap'^lkanls laking 
the College tntiance Ejumsn&Uon Buaid'^ Scfa^b&Uw ApU 
tudc TcsU For the points marked by an x, it W2S neat U. 
For any ^ven ab3it> level, the standard enui 
measurement varies surprisin^> httle, considering the 
extreme raiation in starting item difficulty. 

Various designs were also tned^t ivith mure vulumn^ 
X with fewer than the 10 columns shown m Table L A test 
with 20 columns, sparuung roughl> the same difHwuIt^ 
range as Table 1 but reqmnng 363 items, was found tv ht 
at feast twice as good as the 10-column 182-ltem test of 
'Table 1. The reason for this is not that the.columns m 
Table 1 are too far apart, but mainl> that^leutmg the bcs; 
items (best fux a particular mdmdual) hum a 363 ^tem p^Ji 
will give a much betiei ZS-item lesi than sckwting the same 
number dfsitems from a smallei, ISZ-item pool. Still belie; 
tests could be produced b> usmg still larger item puuls, 
even though only 25 items are adnunislered to each 
examinee. 

It is important to compare the broad-range tailored test 
with a conver^tional test. Let us compare oui broad-range 
tailored verbal test with the Preliminaiy Sdiolastic 
Aptitude Test of the College Entrance Examination Board. 
Figure 2 shows the mfcrmation function fox the Verbal 
score on each of three forms of the FSAT adjusted tu a test 
length of just 25 items. Also the information funutiun foi 
the Verbal score on the broad-range tailored test, which 
administers just 25 items to eaCtt examinee. The tailored 
test shown in Figure 2 corresponds to the hexagons of 
Rgure 1 ,i»'nce they represent the results obtained when the 
first item administered is at a difficulty level appropnate 
for average college applicants. The PSAT information 
functions are computed from estimated item parameters. 
. For points spaced along the ability s&ale, the tailored test 




-oJzS ^ 0/25 



Finure 2* Infoimatjon function foi the ISiUm tailored test, alio 
for three forms of the Prelimuuiy Scholastic ApUtude Test (dotted 
Hnes) adjusted to a test length of 25 Items. ^ 



inft^maiiim ftmction iS csrimatcJ f}«^m J^e lest iopuoses 
of simulated exarmnecs.* 

It IS cn^uuiagln^ but nut suipnsxng tu 5znd that the 
tailored test is at least twice as good as a 25 -item 
wunventional PSAT at almost jSi abxhtv levels. Aftei all, at 
lb: ^me time that we are tailohng the test iu fit the 
individual* wt are taJung advantage of the laige item pod, 
using ihe best 25 items available within veitainiesiactiuns 
alread> rrsentioned wunwcnung item t>pe. It would, of 
v^uisc, be deniable tu ^^tirm this evaluation b> extensive 
lest administrations, using flesh -and blood examinees 
instead of simulated cxanunees. 

In conclusion, the writer would like to make an offer 
fiui should enable research workers and graduate students 

von»rnienti> design and bu3d aUual taHured tests and 
^Jmmisiei them to real examinees. On written request from 
suitabl> HuaLfied individuals, he will p;vjdde estimated 
Item paiameter^ foi the verbal items in anv ox all of the 
following Cooperative Tests: 

SCAT il. Forms lA, 2A, 2B, 3A. 3B, 4A (50 items 
each); - • 

STEP JI, Reading Test, Pirt 1 <mly. Forms 2A, 2B. 3A, 
3B,4A (30 items each); 

SCAT I, Forms 2A, 2B. 3A, 3B (60 items each) 
This represents a pool of 690 caUbmted verbal items 
available foi research oi othei purposes. (This oflTr^XplfS" 
*hen bettci methods fox estimating item parameters have 
been developed very soon, it is to be hoped.) 
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SOME LIKELIHOOD FUNCTIONS FOUND IN TAILORED ^ 
TESTING 

FREDERIC M. LORD 
Educational Testing Service 



Hiis brief note d^iisses some pevuliai likelihood 
funwtiuns cnwuntcrc J while admmx&te;in^ the Bxoad Rsa^ 
Tailored Test J Verbal Ability tw simulated cAaimneo. 
Olhei workers ba^e doubtless cnwvAintered similai .pii/b 
lems. 

Saraepma (19?3j ihjiivs that svhcn the item pajimcicis 
are Icnowa, there rxu> be nu finite abilit> level ^ thai 
xnaxlmize^ the likelihK/od fjnvtiun. AIsu. thai the likelih^A/d 
ftinctionina> have more than one (lota!) maAimum- 

Birnett (1966) states ^Gvtn a angk sample of 
observations ... [r]cfuiari|y conditions ... are jio 
guarantee that a ^ngk root of the likelOiood equation wil! 
C5dst for this sampk. In fact, there will often exist multiple 
roots, corresponding to multiple -relative maxLTia of the 
likelihood functicm, even if the xcgularitv conditions are 
satisfied.^ 

Hiizurbazar (see Kendall & Stuart, 1973, sections 
18Ji-18_12) ^owed under regularity c<Midiiions that 
ultimately, as the number of obseTOtions becomes large, 
there is a unique consistent maximum likelihood estimator. 
His regularity conditions would apply If the test were 
composed of items with Identical ICC. His conditions 
would be violated otherwise, but it should be pos^le to 
extend his proof to cover a reasonable set of regularity 
conditions for the present problem. 

To have a large number of observations, we would need ' 
to administer a" large itUmber of test items. When the 
number of items is^not large, and especially if the test is too 
hard for some indi^duals, we may expect = - » 
Oixasionally. An examinee wlio makes unlucky guesses and 
scores below the chance levelis, not unreasonably, likely to 
get an estimated ability of ^ Such an estimate would 
presumably be corrected if a sufHciently large number of 
additional test items were administered to him. 

In the study on a Broad-Range Tailored Test of Verbal 
Abillty( many tens of thousands of simulated -^examinees 
took various ''imulated tailored tests. Items with known 
ICC were admisistered one at a time to each individua] 
exanunec. After each item was adminsitered, an appfoxinia- 
tion to the maximum likelihood estmiate^ of his^abibty 
was computed, based on all his responses up to that point. 
- Whence exanrdriee has wrong answers but no right 
answers, 6 ^ <». When he has n^t answers but no wrong 
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answers, 5 = * «- When he has both n^l and wio^ 
amwcrs, then is usaall> dintwult> Andjng a imitc 
An oi.4^unaI^^fG«jLiIt> re&^Ivc& itself as. mure items axe 
adimmstcred. It is ver> rare to have an> problem after the 
first ten or fifteen items, anoe by then the item difficulty Is 
usually tolerably well tailored to the examinee s abibty. 

The present study mvestigates the case of simulated 
cA^nuncc T94 fui «.hum -there were unusual difficulties m 
obtaining a fimte 6. Table 1 descnbes the first 23 items 
administered to him, shows his response to eadi item 
(1 = nght, 0 = wrong), and pvcs^, the maximum likelihood 
estimate of his abihty based' on his responses to stems 
already administered. 

Examinee T94 is really a very low ability exammee-his 
true 6 IS actually -23. Furthermore, the first items 
administered to him yfttt very, difficult items (&/ > 1-35J 
which he would have no chance at all of answering 
correctly except by guessing. By lucky guessing, he 
nevertheless got 6 items nght out of the first 12. 

If c/ were ^20 for e^ch of these iums, the chance of a 
score as good or better than 6 solely by guessmgis less than 
j02. The maximum likelihood estimates of the exanunee*s 
abihty based on his performance on these first twelve items 
range from 1.6 to 22» as shown in the last column of the 
table. 

His guesang on the next seirtn items was umfomily 
unsuccessful. All items throjigh item 1 7 were difficult, with 

> 1J5. His performance on these 17 difficult items 
earned him an abili^ estimate of ^ = 1 .2, 

Item 18 was an easier item, ^i g = jS5. I suggest that the 
following rationabzations provide a correct explanation of 
the ^ subsequently obtained. 

The examinee has answered correctly 6 Items with 
b^>l35 and has failed 12 items including one mih 
i/ = ,65. The last failure suggests that 6 is low and that 
earlier correct responses were due to lucKy guessing. If 0 Is 
low, all items so far administered are too difficult for the 
examinee and are of no use, even for placing a lower bound 
on his;tbility level. When an examinee has g?ven only v/rong 
responses and lucky random guesses, his estimated ability 
should be^=- «». 

When the cxarftinee answers item 20 (*20 ' .83) 
correctly, it is now plausible io assume that his ability lies 
between -^JS3 and £5 (.65 being the difficulty level of item 
' 18, ^ch he 2nswered*''incorrectly). The maximum 
likelihood estimate turns out to be ^=-.4. 
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•Kot computed rorn< 12. 

**roz /3 - 2^, — ,1 1. (ht livtcd*? A> ^ AppruAjm^ic value dctummtd Aun)cnt^>, i II, Ihc listed read /roin values 4^ ths- 

log likeUhood tabubted at inteivals .2 along the 0 scale. 



Subsequent faOures on items 21 and 22 lower this 
estimate to - £ and then to 2.6. When the exanunee 
finally fails 'an item with ^ 2.84, it now appears that all 
earUci correct answers ncrc d:5e tw Iuvk> guessing and that 
ail items Su fat administered tverc l^u difficult foi this 
tAamfnee. The iitudtiua iS muwh the ^mc as the sitaatiiin 
after the answer to item 18, already discussed. Again, not 
unreasonably,^ =- 

In this testing, only the very last item was of appropriate 
difficulty for the examinee, whose true abflity was 
^=-2.9. AH but the last two items were very much loo 
hard. He answered both the last two Items incorrectly. 
Thus, it is only to be expected that his final ability estimate 
is S'-- oo. Administration of further items of appropriate 
difficulty would quickly correct this estimate. 

The Jikelihoud functions used to obtain most of the 
successive 0 discussed above arc shown in Figure 1. The 
code numbers identifying the curves are given in Table L In 
order to get them aU on the same graph, each likelihood 



function is divided by its maximum value, so that the 
maxima of the normalized . curves all fall on the top 
boundary of the figure. These curves, together with the 
disw4issiyri^gjvcn abu^r, seem to explam the anomalous 
values uf Vyhen enough respunscs have been obtained to 
indii^te ^ lowei hrmt to the examinee 6 abihty, then limte 
ability estimates will be obtained. 
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BAYESIAN TAILORED TESTING 

AND THE INFLUENCE OP ITEM BANK CHARACTERISTICS ' 



Ccmvenifonal tests are generally constructed to dis- 
ciinanatc over a ratherwidcianjgeofcxsmincc^flity.Onc 
of ihc consequences of this approach is that a conventic^l 
test usually contains many items whidi arc not appropriate * 
for a particular level of aWlity. Esychometndans have long 
been aware of this and in recent years they have inacas- 
in^y turned thdr attention to the possibility of pro2;rani- 
imng computers to design and administer tests. 

Of llie many computerized testing methods which have 
been proposed, the Baycsian process developed by Owen 
(1969) seems to be the most elegant and intuitively appeal- 
ing method. It assumes locally independent binarily scored 
items and a nomial opve model {lx>rd and No\ick, 1968, 
Ch. 16) in which the probability of passng a free response 
item^ at ability level 6 is expressed as 



. ^ cxpl-^jdt 



If the item is not of the free response type and ^ is the 
probability of gucssuig correctly, the probability ofpasnng 
becomes 

^ ^ (2) 

Hie derivation of Owen's Bayeaan tailonng process has 
been described several times in the literature (Owen, 1969; 
Uriy, 1971: Jensema, 1974a), We will briefly run through 
the fundamental formulas here for the sake of complete- 
ness. 

Suppose N(0Q^(jQ^) expresses our knovdedge of an ex- 
aminee having ability 0- If we administer free response item 

which has discrimination and difficulty parameters^ and 
6, and if the exammee responds curreLtl>, Ba>es' thcoicm 
specifies' that the information available is 



p{en)=kPg{e)(y/B a^r' txp 



where /^(d) is defined by (1) and ^ is such that 

CO 



(3) 



(4) 



'Thit rejeaf*- — . ^pported by the Office of Demographic 
Studiei. GiBzadet G>II«ge."Waj}iifjg«on. D- C, 20002. 
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The solution is 



Jfc-» = 1/2(1- erfD/ 



(5) 



where erf D is the enor function 
D 



ti{D=-^S exp(-/2)dt 



and 



D = 



(7) 



t ' ^ The expectation of the posterior mean is 



Fmu = 0^* — exD(-D^) (I-erfZ»-' 

(8) 



and the variance is 



v3r{fl!l}=o» 



2Dexp(0*) (l-«fD) 



1 - - 



(9) 



Sin;ularl> , if the examinee gives a wrong response to item g 
we have 



2a„ 



(10) 



E(0\O)=6„- 



Vff(a-='+Oi,') 



exp(-2)') (1+erfZ))-', 
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and 



(12) 

To expand this discussion a liide further assume that 
item ; is not 3 free response item and that it has a probabO- 
ity Cg of guesnng correctly. If the exananee pves a correct 
response we have 



P'W11}=XP;W) (v^0„)-' cxp 



(13) 



(14) 



and 



var'(fl|l) = o^*- (1- C-)A--' XS* (/- C X) 



(15) 



where the prime is used to signify the effect of guessing. 
P^lCfl) is defined by (2), and we~takc 



t = l-2VWk"'Z)cxpCZ)'). 



(16) 



(17) 



(18) 



If the examinee gives a wrung response the formulas in 
(10), (Jl), and (12) hold, sir.^ our infoiination, that the 
examinee does not know the vi^rre^t answei,ls the same as 
in the free response case. 

Now assume we have n items and want to select (he best 
one for administratiun. The expcctej pustenui vananvc uf 
0 after administration of a particular item is 



(19) 



vrhcn items are of the free response type and 



u 



] 



(20) 

when the items are affected by guessing. In (19) and (20) u 
refers to the correctness of the cxanunee's resprasc and is 
taken as 1 or 0. The item which leads to the smallest ex- 
pected posterior variance is the most dearable one to ad- 
mmister. It is sufficient to select the item with the smallest 
value a where 



a = (a-^ +ci,^) (I-{erfZ>)^)exp(2D^) 



(21) 



for free response items and 



I 



1-C 



(22) 



wlien guessing is present. 

If we have a pool ofn items and estimates of the normal 
ogive model parameters for each stem^ we may use a 
Bayesian sequential procedure to select items for adnunb- 
tration to a particular examinee. Let^^^^ and u oe an 
estimate of the examinee^s ability and its variance wncre m 
indicates the number of items administered. Assume the 
population has abiUty distributed as A^(0,1) and take'^(o) 
and o^f^oj ^ ^ ^- C^culate a,- values for all (unused) 
jtems, r-I,2,-...., (n-m), using (22). (Wc will assume 
(hat the items are not free -response.) The examinee is ad- 
ministered the item mifx the smallest value. If an incor- 
rect response is given, 0 and of^im+i) calculated 
from (1 1 ) and (1 2)- If thrsci^onst is conect, (14) and (1 5) 
are used. This cycle is repeated until ^{ffjy is within some 
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prc^lcctcd limit. The sclccUon of 2 o;^,vaIut for terms 
nation is. of couise, arbitiaiy . !t is usually sdccled tr» yk'ld 
some expected level of validity according to 



Cm) 



(23J 



The characteristics of an item bank used for tailored 
testing are very important to the effidcncy and accuracy of 
the process. There a.ne four basic requirements for a good 
item bank. These have been mentioned in whole or pari in a 
number of publications {U. Urry, 1970, 1971 • i971b. 
1974; Jcnscma. 1972, 1974a, 1974b;ctc) and may be sum- 
marized as follows: 

1) Item discrimination should be as hi^ as possible and 
should not be less than JS. 

2) Item guessing probabilities should be as low as pos- 
dble. 

3) The item bank must consist of a sufficiently largs 
, number of items. 

4) Item difficulties should have a rectangular distribu- 
tion. 

The rcmamder of this paper will coni?cntratc on demon 
straUng the importance of each of these fuui requirements. 

Assume that an mfmitely large item bank exists and that 
all items have the same discnminaiory power and the same 
probability of guessmg correctly. The assumption of an 
mfimtcly large item bank allows the selection of an item 1 
having a difficulty le'*el exactly equal tu any ^vcn estimate 
of ability- When this can be done many of the fuimulas 
may be greatly simplified since we have. 



and 



(24) 



(25) 



Hie equations for a^i^ + d for correct and incorrect 
responses become 



nii-qf 



(26) 



and 



X 



1- 



(27) 



where m is the numbei uf items prcviuusly adminisleicd. 

An item Ts difficulty is the point at which the probabil- 
ity of knowing the correct ar^wer is exactly ^. If guessing 
1$ in effect the probabihty of re&pondmg correctly is equal 



tu ihe pii/babduy uf ^^tnvm^tht.iftiACi plustlic piubabu 
iiy of gucsang correctly. Tlien cr^^^l^ may be expected 
to be the sum of (26) and (27) wei^tcd by the probabiH 
tics of a correct or incorrect response^ 



(28) 



A little algebraic manipulation reduces this to 



I- 



2//(l - Q) 



(29) 



Inserting appropriate >'aiucs for and q in equation 
(29) and plotting the results against the number of items 
administered demonstrates the inOuena of item discrimina- 
tion and jessing probability on the tailoring process 
Figure 1 plots the expected standard error of the estimate 
eO(^+j.) by the number of items administered for fi>-c 
levels of disainunation when guessing probability is tero 
and an infinite number of items arc available Notice the 
sharp difference in the number of items needed at different 
levels of discrimination. For example, if the items have dis- 
criminatory powers of 26 only 4 or 5 items arc needed to 
readi'&standard error of the estimate of .30 while 17 or 18 
items are needed to reach this level when item discrimina- 
tion is only LO. 

Now suppose we take item discrimination to be LO,a 
rather low value which is easily obtained. Figure 2 plots the 
expected standard error of the estimate for various guessing 
values by the number of items administered. The guessing 
values range from 6 (i.c. true-false items) to 0.0 (i.e. free 
response items.) The greater the probability of guessing, the 
more items required to reach a specific standard error of 
the estimate. 

To pvc a clear example of the combined effects of dis- 
crimination and guessing on the tailoring process, suppose 
we have three item banks which, for con>^nience, arc 
referred to as 1, 11, and IIL Assums Bank I items have dis- 
crimination and guessing paramcnters of A and .33. Bank 
II's parameters arc 1 JQ and .25 while Bank III has parameter 
values of 2.0 and 20. These banks may be roughly 
classified as u/mcceptable, fair, and excellent for tailored 
testing purposes. Assuming that each bank has an infinite 
number of items and plotting the expected standard error 
of the estimate against the number of items administered, 
the three curves in Figure 3 are obtained. 

In Figure 3, notice that Bank I would pve unacceptable 
results. Aftpr 30 items the expected standard error of the 
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Figure 1. Expected standard error of the estimate according to 
number of items administered at five levels of item diseriinination. 
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Figure 2. Expected standard error of the estimate accordmf to- 
number of items administered at six guessing probabilities. 
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Figure 3. Expected standaid error of the citmwte for three item 
bznVs according to number of Items administered. % 
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estimate is only ^6 (ix. rcLabilily - .69<, validity = -83). In 
contrast an excellent item bank, such as Bank III. would 
reach this level after only 3 oi 4 items. The advantage of 
high disvniTunatiun aitJ 1^^ guc^smggiJbabilit^ isi ar» 
bank is obvious. 

Up to this pomt w& have disvu^d the bchaviui of 
Bayesian tailored tc&ung when ih« item biink^ assumed to 
be of unbimted 6ue. The obviuu^ question wluwh follows o 
what happens vviien item bank sxic^ arc withm piacti<^l 
hnuts? To answer this question, Monte -Carlo dau foi 200 
Items are generated foj ea^h of 100 ^'examinees** u^m^ 
Urry's (1970) "LOGIST* program. The parameters for 
discrimination (1.0) and gucssing(J!5) were the same as for 
Bank II mentioned earhei. Eight ^ts of 2S difftuult> valuer 
(-2.4, -2.2, 0.0, 2.2. 2.4) were employed. 
Bayesian lailoicd testing na^ ^;mulatcd i\ith ilm data a^m^ 
50, 75, 100, 150. and 200 items m the bank. Since 
difficulty had been ^pcvified in of 25 valuer, the iicii* 
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Validit> {TqP Obtajncd With Different Sue Item Banks 
(Mon1c^ioD3t3.Ar=100»/t=l.0.0.25) ' 
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♦fcxpccted validities vaJcuIalcd /rum cnuations (32; and t23> fox an 
imsiginaxy bank having an infinite number of items. 
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banks had 2, 3, 4, 6. and 8 items at each of the 25 
difficulty levels respectively. 

Fof each of tlie five item banks and foi'eawh of the 100 
cjununees, tailoiing wa^ emulated anUl 30 items Jud been 
"administered". As each item \vas "adminbtered" the new 
estimate of ability was recorded. Since the data was 
iandoml> gcnciated, tiue ability (distributed as.V(0,l) v^as 
known and vould be correlated with estimated ability. 
Table I gives the validity (correlation between tme and 
estimated ability) for ea^h item bank by the number of 
itemi 'administered**. The last column m Table I gives the 
expected vahdities for an iteni bank of infinite size as 
calculated from equation (32) and (23). 

The Monte -Carlo data above represents items which are 
passable but not especially good for tailored testing. To see 
how item bank size would infiuenwe validity wh .i the bank 
was composed of excellent items, the Monte-Carlo data 
tadoring simulation was repeated with highex discrimination 



TABLE 2 

Va!idii> {rQ<§) Obtained With Different item Bank Sizes 
(Monte-Carlo Data, Ar=i00,/1=2.0, C=.2) 

ITEMS IN BANK 



Items 
Ad minis- 



tered 


50 


75 


100 


150 


200 




1 


-.66 


.66 


.66 


.66 


.66 


.58 


'2 


.75 


.75 


.75 


.75 


.75 


.74 


3 


.84 


.84 


.84 


.84 


.84 


.82 


4 


.89 


.89 


.89 


.89 


.89 


.86 


5 


.92 


.92 


.92 


.92 


.92 


.90 


6 


.93 


.93 


.93 


.93 


.93 


.91 


7 


.94 


.94 


.94 


.94 


.94 


.93 


8 


.95 


.95 


.95 


.95 


.95 


.94 


9 


.96 


.95 


.95 


.95 


.95 


.95 


10 


.96 


.96 


.96 


.96 


.96 


.96 


11 


.97 


.96 


.96 


.96 


.96 


.96 


12 


-97 


.96 


.96 


.96 


.97 


.96 


13 


.97 


.97 


.97 


.97 


.97 


.97 


14 


.97 


.97 


.97 


.97 


.97 


.97 


15 


.97 


.97 


.98 


.97 


.98 


.97 


16 


.97 


.98 


.98 


.98 


.98 


.98 


17 


.97 


.98 


.98 


.98 


.98 


.98 


18 


.98 


.98- 


.98 


.98 


.98 


.98 


19 


.98 


.98 


.98 


.98 


.98 


.98 


20 


.98 


.98 


.98 


.98 


.98 


.98 


21 


.98 


.98 


«.98 


.98 


.98 


.98 


22 


.98 ■ 


.98 


.99 


.98 


.98 


.98 


23 


.98 


.98 


.99 


.98 


.98 


.98 


24 


.98 


.98 


.99 


.98 


.98 


.98 


25 


.98 


.98 


.99 


.99 


S9 


.98 


26 


.98 


.98 


.99 


.99 


.99 


.99 


27 


-.98- 


;98— 


— ;99 - 


.99 


.99 


.99 


28 


.98 


.98 


.99 


.99 


.99 


.99 


29 


.98 


.98 


.99 


,99 


.99 


.99 


30 ^ 


.98 


.98 


.99 


.99 


.99 


.99 



*I.Apcuted validities wakulated from equations (32) and (23) for an 
imaginary bank having an infinite number of items. 
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(2-0) and lower guessing (-20) parametci values. Tlicsc 
configurations correspond to Bank Hi mentioned cariici. 
The results of the simulated tailoring ^ith tliis new data arc 
^ven in Table 2. 

For practical application It is apparent that a vei> laigt 
number of items is nut a vritival item bank wharawtciisut if 
the bank is good in other respects. In both Tabic 1 and 
Tabic 2 the Monte -Carlo data validities obtained fui the fi\^ 
banks closely match each other and thc> also parallel the 
valid Jes to be expected from a corresponding item bank of 
infinite size. However, it must be remembered that this was 
Monte-Carlo data and the tailoring simulation used known 
parameter values for discrimination, difficulty, and 
guessing- With real data involving imprecise parameter 
estimates and a possible non-uniform distribution of 
difficulty, it would be wise to be a bit cautious if a bank 
had, say, fewer than 75 itenis. In connection with ih^, 
there are some practical problems which arise if an item 
bank is too large. A large bank has more items available for 
administration, but the storage requirements and the 
increased computer processing needed for item selection 
also slow things down while addmg to overall computer 
costs. (Some good cost-efficiency studies are needed on 
this!) 

The last item bank requirement is uniform distribution 
of difficulty, Tlie exact results of violatmg this rule are 
difficult to predict, since they would necessarily depend on 
the actual distribution of item difficulty, the discrimination 
and guessing parameter values, the number of items in the 
bank, and the criteria used to terminate the tailoimg 
process. The essential point to remember is that the 
Bayesisn tailoring procedure attempts to select for 
administration the item which will yield the most 
information. If, at a particular level of difficulty, there are 



nu iteim available, the Bayesian prucc^ will be fuiccd to 
^Itwl an item which h> not appiopnate and ^hich will yield 
less than an optimal <imount of mformation. 

To summarize, this paper has outlined a Bayesian 
appiuach tu item election foi tailored testing. Toui basic 
le^uirement^ of a good item bank foi dn^ pioccs^ have 
been diScUSsed. if these requirements arc met/ Bayesian 
tailuicd testing will yield cAcellent results. The key to the 
piocess lies in uaicful construction of item banks. If 
attention is given to this» the Baycsian tailoring piuccss 
give^ Ob a fundamental tool foi piactical application of 
latent trait mental test tlieoiy, 
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REFLECTIONS ON ADAPTIVE TESTING 



DUNCAN N- HANSEN 
Memphis Slate Umrcrsity 



The purgasc of this paper will be lu rcflctl on vanuus 
aspects of ihc jdapiu-c testing field. Budding fiom uui piiui 
Memphis State University and Air Taru; work m the area, 
the vanous issues, altcmatu'cs, priorities and ultimate 6t>!es 
of research for adaptu^ testing will be plated m the context 
of empirical findings and institutional requirements. Tlie 
rationale for proposing such a pontifical and extremely 
challenging task is twofold- First, all our substantive *.nd 
cmpincat work was recently reported (Hansen, 1975) and it 
would seem supcrOuous to rewnte oi try to extend this 
research pno^^t^ more effort, therefore, only the nujui 
questions and findings will be summanzed m this papci. 
Secondly, the various charactenstics of tlife adaptive testing 
field will be reflected on in terns of research productivity 
and institutional requiremeins. Having by scholarly 
necc^ity been forced to read extensively in this domain 
over the past five years and. in man^ instances, to take a 
pencil in hand to follow a variety of foriftal derivations, I 
think It appropnate lor me to comment abopt vanous 
purposes and styles of research. This is not done-to criticize 
any of these models but rather to senoustyaddress the 
question. "Are we moving in the most profitable direction 
and using the most expeditious procedures?" 

MSU Adaptive Testing 

Generic to any research m adaptive testing or that 
relating to the whole educational enterpnse is a clear 
understanding of its purpose. For our group, the purpose is 
that of facilitating achievement or mastery testing. Within 
industry and military training it is common to find that 
testing time and managerial demands, especially for 
individualized techniques, are now taking upwards of 20 
percent of the total training time. Such a training 
commitment becomes sizable and the systems managers 
must inevitably ask the question, *'ls there a more efficient 
and effective way of going about it?" For example, the Air 
Force Advanced Instructional System will ultimately have 
700 students aboard fpr any given training shift (2,100 
students per day;. If one considers that their day consists of 
SIX hours of instruction and that approximately 20 percent, 
of this will be given over to testing, one can see that 72 
minutes are being allocated on the average for each 
student's evalution pei day. If such testing time can be 
reduced by 50 percent, an adaptive testing goal set foi our 
.efforts, then effectively 1.5+ million dollars worth of 
salaned money can be gained by shortening the training 
lime for the 2,100 manpower units flowing in this system. 
It IS precisely this type of monetary achievement that 



impresses our representatives in Congress concerning the 
importance of research ideas applied to significant 
eduwational problems. As will jbc: suggested later, such 
spcvifiw, operational goals, wliilc unachje\'ed to date, mc 
the best rationale for continued research support in this 
area. / 

As a corollary to the efficiency issue, an acco.*j»panying 
objective concerns the cfficieiU application of doinputer 
technology to the testing process. In essence, one can 
demonstrate that adaptive testing falls closer to the drill 
and practice end of the computer usage continuum 
(Hansen, et, al., 1973) and certainly b orders of magnitude 
less demanding on a computer than CAI or simulated 
training. Our experiences and computer algorithms can be 
offered to vou for your consideration. These document an 
efficient use of computers, tools which are fast becoming 
integral to the educational processes within our human 
institutions. 

Finally, adaptive testing should be considered within the 
context of a total systems effort. Fpr our group, adaptive 
testing is just one component within an overall adaptiw 
instructional system. As one significantly alters the 
environment and the sequence of educa*:onal elements so 
as to foster or optimize learning outcomes for a pven 
individual, one can see that testing becomes just one more 
component in such a stream of events. One should look at 
it, though, in terms of its contributions to the individual 
and the institution, be this increasing levels of competency 
or the educational system itself. Thus, one can contend that 
theoretical models have little or no value unless placed 
witliin such a system context since it is the context which 
will mold and determirie the criteria, values, and operation 
by which its characteristics shall be judged. Let us turn 
*then, to the specifics of the MSU adaptive testing model 

MSU Adaptive Testing Mode! 

Our adaptive testing approach involves three com- 
ponents, namely, the entry of a student into the test, 
tailoring the test items for the student, and adaptive scoring 
procedure. Each of 'hesc will be discussed in turn In 
reference to the entry and test composition processes, a 
student is entered at a level commensurate with our 
prediction of his uhimate performance, therefore, using 
linear recession techniques mostly composed of variables 
from j)rior test performances, a student is placed into a 
monotonically arranged test. Such a procedure seems to 
work quite successfully and has an additional advantage of 
reducing the number of test items to be presented fof^any 
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aQ(krstasd!0£ of the ikxilevel algonthm.) 

IrhiJe we have very liiruted data ccsKenrng the dficzcy 
of this procedure, catiy to final score correktions to 
be la the low J80 ran^. 7I»^ are sunbi to cor elatica 
coeijQdents reported by Geary at tht Uaivesity of 
1h^$cccmn for students who were placed in a hrjiA test 
accor(£ng .to a predicted outconie level (a personal 
comnnmication at an AERA conference in 1969). Thus, the 
adaptive entry of a student seens to be a positive step 
fcHward and should be taken into account by an> model 
working within this field. 

In reference to test coin >ositjoQ,:t can be specified that 
each student, based on his^ entry profSe, ^iS have a 
^specially developedset of composed jtems. These composed 
items -may n;8ect information concemmg the student's 
piior performance oh various objectives which form the 
achievement test^ Iherefore, if one has mforaiatioa about a 
student^s achievement of diese objectives, there is no 
rationale foi presenting the item. It is precse^ this concept 
of test oKnposition that appears so advantageous, although 
It has not been empirically pursued. One can anticipate jhat 
sometime within the next year one ui the militaiy training 
systems will pumic it in greater depth. 

Taibred Testing of items 

As mdicated. Lord's flexilevcl algonthm is utilised fox 
tailonng the presentation of test items. For achievement 
testing, this approach violates the assumptions as to 
normality as ajuomatically represented withm this model, 
but It can empincally be countered that cur findings justify 
the utdization of the algonthm from a student and systems 
pomt of view. This adaptation is precisely the abihty tu 
move between very difficult and veiy easy items while at 
the same time adjustmg cutoff cntena wliere considered 
appropnate (up to this pumt out gioup always used end of 
test item cutoff procedures but others could be 
considered). Achie\tment and mastery testing, especially in 
a technical training environment, ^ways tend to yield 
asymmetric performance score distributions. Such distribu- 
tions, if better understood, could be more readily adapted 
to fkxilevel testing and yield optimal algorithms. 
Obviously, no attempt to prove such an assertion has been 
made at this point. 

m 

Scoring 

Out views on sconng represent an attempt to remain 
consistent with the traditi(»ial procedures .of adding up all 
correct responses and giving weights to those items that are 
most difiicult. Therefore, we have used the Green 
procedure (Green, 1970), that is, an averaging of the 
correct item difficulties achieved by a student. Vmg the 
flexilevel algorithm and this scoring process, the overall 
reliability and validity of the adaptive testing procedure 



^ms reaM;n£b}> ^iisfaUuiy as ii yields ^A^flident* that 
^'ary between .6 and J8 alpha ooeflicsents and paraBe! 
test coefficients}. 

In addition^ wc are making plans to contrast two 
additional adaptive routines so as to resolve what we 
perceive as a critical probity namely, the critical zone 
perfbrmei. In any gsven training situation, there is a critical 
cntenon zone, typically being between the 70th and 90th 
percent level- wluch is stipulated as a requirement for the 
attaurng of course mastety. If a student scores cJose or 
withm this level (consider jt bemg bounded by the standard 
error of measurement), then one should cdlect more 
infcimation prior to jud^g this student as having achieved 
the objectives or in seed of further remediation. At least 
two apprcadies^^pan be considered to resolve this proUem. 
The first is an obvious approach ^n^ly involving the 
presentation of an addi&on^ set ofitems for this^zone; this 
is similar to a branching test. A more ptotrisang one/ 
cspedaily ^ven the role of the computer, is Bock's (1972) 
procedure for item latent structure which makes use of the 
. information contained m wTong altenutive answ^ers. The 
Bock model appear^ to us to be a far more preferable 
procedure m terms of ongdhg!ar«e -flow trainpg situations 
and it shaO be evaluated during the coming within the 
AF/AIS context. 

Data relating to reduction in testing time indicates that 
only appro^dmately 31 percent of the items are utilized if 
individualized entry and adaptive tediniques are ^ployed. 
This yields a 150 percent savings in testing time. The 
samples unfortunately, were extremely small and our group 
locks forward to, a much more extensive validatioa stud^^ in 
the AIS military training situation. Similar savings are 
reported by Tarn (1973) in his study of affective adaptive 
testing although rnodest ones were reported by Hedl (1971) 
in his intelligence testing. All in all, the results are 
sufficiently promifing to extend the validation for these 
approaches as weU as explore alternative designs within 
realistic training situations. These alternatives form the 
substance of the remainder of the paper. 



Issues in Adaptive Testing 

As an active reader and investigator in the adaptive 
testing area over the last ei^t years, one general 
observation comes to mind, namely, a classical psycho- 
metric approach emphadzing those cherished characteristics 
of excelknce, improved reliability, validity, and conse- 
quential individual description, is limited in its systems and 
institutional view. In essence, our efforts have been to 
describe each and every individual in reliable, finegrain 
terms vAifle recognizing the needs to improve the testing 
system- Given these broader insights, the purpose of this 
section will be to raise issues and possible alternatives as 
reflected by priorities concerning objectives for adaptive 
testing. There are three areas to be conddered as reflected 
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by ihcsc^ quencs: {1} What arc ibe posabk purposes few 
adapuve sesling? UJ Wlm types of lomul inodeU mijjit 
best be puisiscd fox zdipisve tc$iin*1 and (3) Ha» canow 
tbcoreUczl and procedural nieihods best be cidualcd? 

ftaposes for Adaptive Testing 

The iradjuon wilhjn psydigraelnc iocaich as 'A'cU as 
test development hai foujscd an dooipliuns and decisons 
OHiceming individuak. On the olhci hand, txany 
insututions bcbevt group diffe?en«xs in the testing process 
should be stressed sm^c il is auup data ihal form the bans 
of decision making. For cxampSc. in the current 
controversy conucniing the tA>ntiibutii^ v>r schos^ and 
carnculum effects, Ral.o» (1974) argued that tests ha\^ 
been consuuutcd to nuAimuc on *ndaidial diiwnminatii^ns 
and lo nuranurc gsi/up dJTcieni^ Thcrefurc it is not 
purposing thai t^nt finds au statistiuaiiy sjgnifican! group 
effects foi sUiuois cummlums. the &>kman study 
11968) oi the Jcnuks.folluw-on stud> (1972) represent this 
type of outcome. Rakow irgues ihat if one utilizes 
inter-Uass cui relational tcJuuuues. one van find }ughl> 
significant relaUonships uf a subset of items whJA 
distinguish among gioups. Fui adaptive tests that attempt 
to support large human oiganizations such as m3itar> 
tiaimng, this implies that *.lassif>ing an individual 
concerning group membership and the Juraclcristics of 
this group IS of a high priority- This adaptive testing 
approach would utilise a bianJjmg item icuhniquc >u to 
lead to rebablc alternative gioup classifications fui an 
individual. Having achieved this, then the mure conven 
Clonal sndi\idual discnmmation tethniques 4.ould be 
applied. Obwously, the utib/aUon of a fie-xilevel algorithm 
based on appropriate mdividual placement would be 
preferable. The point of such a two-stage model is to 
provide for more effective adaptation foi group placement 
and ultimately for maximi^g on institutional criterii 
rather than mdividua! cntena alone. Simply, might it be 
better to find the correct group for an individual rather 
than know his "true score** on some ability dimension? 

In turn, one can look at training systems and recognize 
that there is a trade-off between trainmg load vs. standard 
error effects. In essence, as the training load absorbs more 
and more of the readily available resources, an impro^- 
mcnt in the testmg process with an associated reduction in 
standard error is superfluous since all the remaining 
individuak will have the same minimal treatment. In 
essence, each student is Ukely to spend long waiting times 
and not be able to pursue any kmd of optimum course of 
instruction. Under such arcumstanccs, it is therefore 
cntically important to identify those individuals who can 
pursue self-study where appropriate. Moreover, it nught 
also be highly important to have adaptive tests that bcttei 
detect those individuals who seem to have aptitudes for 
transfer, so that when branched forward or back for review 
withm a nonmal sequence of mstructicm, they will receive 
facilitating efTecls rather than negative ones. 



In lum. as the ironing load on tcso^ces dim2nidics,one 
dioiild expect the lest length ro increase so as to reduce 
errois of n>easureinent. Tbm^ one can see that a systems 
approach lo adapti%*e testing tends to rened a far more 
dynamic -xcdure ii-Kdi mkjit chance the criteria* the 
test length, and the algorithms depending on the state of 
the training s>*$tcm. 

Finally, to be optimally adaptive, one should recognize 
that oui clientele and thdr institution basically do not 
understand the amccpts. methodii, or models of adaptive 
testmg. To them, the quantification, especially^ as 
represented by oui p!>chometr3c jDodek, tends to defy 
uiiderstanding. Allow me to iiiustratc. HSU has been 
tea Jiing a measurement «x>ursc on base at NAS, Memphis 
Two of the students were comrranding officers <^ Nai-y 
icJmical Uammg sdiuoU and ha\T direct responsibility for 
^upcrasmg the measurement processes within thes? schools 
After conipleting an ei^t-week course, each volunteered 
thai thc> had, prioi to the course, never understood any of 
ihc quanuuuvc test item statistics oi reports other than 
those concerning students pasang or failing, the all 
important attrition rate. To be adaptive the system should 
provide the ^.ommijnding officers, instmctors, students, and 
oihci concerned people ftilh verbal reports rather than 
quanuutiw reports, thus, a client-oriented product 
approach would vastl> enhance the acceptance of adaptive 
testmg. The work of Fowler (1969) v^ith the MMPI 
successfully demonstrates that psychiatrists readily desire 
and understand verbal interpretations rather than quantita- 
uvc reports of the 13 MMPI subscdes. These observations 
about institutional effects hopefully will stimulate your 
mtercst in thinking about your clientele as well as your 
aix^el when ycu formulate some of your prionties for 
future research. As cited in the introduction, adaptiw 
testing research must be schobrly, diligent, and of the 
highest quality whDe reflecting a form of institutional 
adaptation which can be appreciated and supported by the 
clientele who provide the resource support for all research 

Psychometric Models for Adaptive Testing 

Within the tradition of adaptive testing research, one 
reads numerous reports that focus on the comparative 
merits of altcmativc psychometric models for adaptive 
testmg. It shall be the thesis of this section that pursuit of 
an optimal adaptive testing model is likely to be ineffective 
and the adapthre testing domain needs a strategy ftCH- 
identifying selection criteria that chooses among the many 
existing models. Optimization studies, espcdally from a 
formal pant of view, have been pursued for the last 30 
years in different contexts with surprisingly similar 
indifferent results. For example, during the 1940's many 
statisticians pursued within analysis of variance models the 
issue of optimal a posteriori mean difference tests. After 
better than a decade and a half of effort, John Tukey 
(1962) observed that one could not really argue for tJie one 
best a posteriori test because each varies according to the 
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dedaon cntena of Iht investigator. la essence, tt is the 
dzaiactensxsc of the researdi vAudk dctcirouics »hiii^ anc 
of the many tests is the wosl zppjopmic^ 

In turn, 3ie area of mathematical Jcaiaingrooadsolfcis 
a siixuhr finding- Within ihc context of research on the 
alloMione vs. incremental learning processes daring the 
cariy 1960's, one notes a fiuny of reswich, all of uhiA 
eniled uilh the conduaon (AtJdnstHi, et* aL* 196S) that 
each mathematical Icanung model has a set of lad; 
diaractcnsucs winch allows it to be optimal provided that 
the tfpnon task ch2ractcxist«cs are sufndently matched- 

Recently a great deal of effort has gone into the 
mvestsgation of adaptive mstructional models from an 
opumzation p<m;t of view. Generalized approadiesindudc 
vanous regression models. Whfle these regression models are 
deady non-optimal, they have proven ^gniflcantly 
st3ccessful m fadhutm^ the process. On the oihe: hand, 
fairly specific modeb, be these Markoff processes cm 
dynamic programming structures, prowde an elegant 
thcoreUcal explanauon (Hansen, et.aL, 1973) but rarely fit 
the data or facflltate leammg. Thus, one is led to the view 
that an array of models for the instructional afea will be 
necessary in order to fit the rather diverse nature of the 
learning process. 

Ba^^cd on these examples, the proliferation of psycho- 
metric modek for adaptive testing is likely to ha\e linuied 
productivity. Our efforts to focus on the criteria to be used 
for the selection of a given adaptive testing model and a 
better descnption of how to test the model's fit with the 
given behavioral phenomena would seem to be a more 
desirable direction in which to move. 

Validation Procedures 

As has been observed by each of the rewewers in this 
area, the amount of empincal work is modest at besL If one 
considers cnUc^ topics, namely, sample size and design 
techniques, one is even further impressed by our modest 
beginnmgs. For example, m reference to sample-size there 
are those such as Bock (personal communication) who 
would advocate that at least for his latent item structure 
model, a sample size of 2,000 students would be required. 
While pursumg some of the test dau for the Air Force with 
a sample of 1,000 plus airmen, the groups were di«ded into 
samples of 200 each and then the usual reliabflity and 



iiSiidity andysis «'2s perform^ la audition, each sample 
»as progrcssii^ely aggregated into the azxf ft is faidy clear 
that the parameter convergence process^ ivas still taldng 
j^acc after the sample are had incrcap^d to 800- Therefore, 
it can be argued that it is important to consider irojcmiang 
on sample size and to devdop techniques by i»iich both 
item and test parameters convrrge on their appropaate 
group and in&idual talues. 

In turn, our review of the designs for vdidation is 
consistent with that proposed by Tam (1973). oamdy. thaf 
one has to consider a within-test as wdl as z bef wcen-test 
\*alidation procedure. This can be achieved simultaneously 
if one notes that one can present adaptive testing as a 
variation within total test procedure. In turn, this can be 
contrasted with a par2llel form j^esentation. The two 
statistics, correbtion between the two adaptive and total 
test scores and the correlations between the two paraBd 
forms, >ield a comprehensive representation of the validity. 
While this may seem excessive to some, such v^alidatioc 
procedures provide more substantial empirical results which 
deady indicate the justification for reducing total test 
items. • 

Summary 

This renew and reflection has run on in a rather 
extena^'c maimer. Furthermore, it seems inappropriate to 
Im'e reflections on reflections. Therefore, this summary 
will state a final print of wew, namely, adaptiw tesllng is 
suffidently dynamic that multiple concepts and hypotheses 
can be incorporated in a desigi sequentially so as to 
deternune their efiect on the effidenQr and effectiveness of 
the assessment process^ This extensive review of a number 
of ne^ccted topics should not be taken as a set of 
imperatives for research. Rather, these to|Mcs and 
su^stions can best be conadcred as potential variations 
vwthin experimental designs of the future. They are offered 
to you under the assunuJtion of coUe^ productivity and a 
firm commitment to the human and sodetal benefits from 
adaptive testing. Of all the evaluational techniques available 
to us at this time, adaptive testing offers that chance to 
humanize our assessment processes. Such an eventuality, 
especially in terms of shortening high-stress situations 
commonly found in testing, cannot be minimized in terms 
of its benefits. 
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COMPUTER ASSISTED TESTING: AN ORDERLY TRANSITION 
EROM THEORY TO PRACTICE 



The United States Qv2 Scnioc Comnusaun is 
ttsponstM foi cjtananine applicants foi Federal jubs 
ihrou^oul the wodd. It examines almost Ivk^ miHion 
persons and /nates zbaul 200,900 ple^mcnts annuaS>, 

Tbe Comnfsslun s investment jn computerized adapuve 
testing research and development is a sjcaifkant one. This 
exdting and innovative program is currently budgeted at 
almost S200^XX) per yea?. This expenditure comes at a 
time Alien Federal agencies* budgets are most austere and 
when resources are sorely needed to respond to tbe 
increasing challenges faced by conralional e3caminmg 
methods. 

Ihe Cbmmisson*s investment in computerized adaptive 
testing is based primarily on the potential payofT in 
improved employee selection and placement. The large 
numbers of examinatiosis and applicants makes com- 
puterized adaptive testing an economacal, practical velude 
for improved measurement. Tlie answer to attadcs on tests ^ 
in the employment situation is complex; the economic and 
social^ implications of this problem are enormous. 
Unquestionably, however, the greatest benept-both to the 
employcr^and to the employee Kes in better measurement, 
not in less measurement. Every improvement in the 
selection and pbcement processes should contributeto Ifa^ 
econonuc health of the employer, the psycholo^cal well 
being of the affected individual^ and the welfare of society. 
Computer technology offers hot only an opportunity to 
make significant improvements in employment deciaons 
but also a better means of assessing the effects of such 
improvements. 

While there are problems yet to be solved, computerized 
adaptive testing is well on the way to implementation. 

As conventional approaches to test construction are 
modified m light of developments m latent trait theor>, 
computerized adapuve testing becomes more and more 
feasible. The Rasch Model diowed capabOitles for 
computerized adaptive testmg m the special case where all 
Items discriminated equally and were unaffected b> 
guessing. This spcaal case was simply not practical to 
expect m available test items (Urry, 1970). Since item 
requirements fui three parameter logistiu oi normal ogive 
models can be met with existing items (Lord, 1970), 
computerized adaptive testing can be implemented. The 
implementation can be cost effective (i^., the number of 
test items administered is substantially reduced vis^am 
conventional testmg) when certain rigorous item bank 
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^ctitcations can be met (Jensema, 1975). The detemina- 
cion that the item bank ^)eancatioas can be met with 
existing iteizss is contingent upon a new look at 
4^venUonal item sutistics and theix relaDoruhip to model 
parameters, ll has become apparent that the distortions 
caused by guessng result in severe undextstiisates, 
particularly of item c&scxininatc^ powers (Uny, 1975). 
Reliabk estimates of parameters can now be made (Gu§e!, 
et 1975). An algorithm ousts that will allow on4tne 
computer^nteractiw item caEbration (Schrmdt A Imy, 
1975). 

Problems remain in tailoring test batteries to specific 
occupational requirements and in adequate coverage of 
job-related alnlities. Of serious concern are the time and 
dc^lar resources that are needed for comprehensive 
measurement. The improved medium of presentation 
inherent in the hardware will facilitate resolution of these 
problems; for example, new item types and audio input 
possibilities. 

Application of computerized adaptive testing in dvil 
service examining ius several desirable features. 

Job relatedness. With multivariate test item banks, it is 
feasible to interpret scores on specific abilities in terms of 
differential occupaticHial requirements. This then enables 
the employer to test a large number of abilities and to 
weight these abilities In accordance with their importance 
Tor success in specific jobs. The employer can array 
applicants saoss a large number of jobs and select in terms 
of priority, thus maximizing the utility of the selection 
process. 

Standardized Examination AdministratiotL Individual 
differences among adnsnistrators under conventional 
testing make error variance due to unstandardized 
adn^nistration largely unavoidable. Since administraticm 
procedures can be programmed under individualized 
testing, standard conditions can be better maintained. 

Compromise of Examination Materials. Under com- 
puterized adaptive testing, exaunination questions are 
located in a central computer. No test booklets are used, 
therefore none can be taken from the examination room. 
As a result, the security of t^ts and test questioris can be 
maintained more easily. Different individuals will receive 
different sequences of items, reducing the likelihood of 
cheating. 

Improved Administrative Procedures. Test booklet 
pnnting, storage, and distribunon vosu become inuonse 
quential. 
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Exummtion SckeduSng. Tests can be adxmnistered on 2 
mlk*ifi^ baas ssce <fiffertnl tests caa be administered 
aaidtaoM»sly. The shcHteoed testing time nakes possble 
the admifustntion c^a multiple abilities batteiy la the time 
now leqoired to exanxne for a sin^ jb3ity. Ftmheiy if 
selection is spedfic to a pven podtion« indmduaHzed 
festifig for die leqiured abSitics can be accocc|£shed in a 
manner that nxmmizes^ the time of testinf while 
nttximizanf ^ job reUtedness of a final wd^ted sane. 

/bwvr Conditions of ExMrnkMion. Tests of aHIity 
should be |>ower tests. However, due to adnunistiative 
cofindenitions, ije^ scbeduang, spicc restrictions, etc^ 
conmitiocal tests of ability are usually ^>eeded to a certain 
degree, Under computerized adaptive testing, the po»'er 
conditions reqiired by tHs ^^pe of test can be ensured . 

TtU-Tmkitig Mothmtion, Test-taldng motivation and^ 
consequently, test performance may be' impaired when the 
Jevd of diiTiculty of the examination material is 
inappropriate to the kvel of abiUty of the examinee. In 
conventional testing, the examination is constructed for an 
entire population. This method of construction necessarily 
leads to inappropriate question difliculdes when a 
conventional test is presented io a pvcd examinee. In 
computerized adaptive testing, the difficulty level of the 
questions is matdied to the level of ab3ity of the examinee. 

Improving Examinations, The current conventional 
testing technology is the product of more than fifly years 
of research and development. Substantial improvements 
have been less frequent with the passage of time. This calls 
for a rather dramatic change in testing procedure. At 
present, the appropriate change would be towards an 
individualized testing technolojjy. Certainly greater experi- 
mental control and a thorough monitoring of the 
measurement process is made possible throu^ the aid of 
this new medium. 

Improving Personnel Dedsions. When^ a computer 
interactive network has been established for'individualized 
testing, one has necessarily establi^ed a vast data accession 
network to effect immediate evaluation of the personnel 
decision making process. Optimizatiort in the deddon- 
making process is the natural extension of events ^hen 
many sources of information are avaflable to a central 
computer and are readily accessible for analyas by the 
personnel researchei and personnel specialist. 



It appears, at this time, that computerized adaptive 
testing ^search has processed to the posnt mheie 
in^demenUticc will be feasible. In Fiscal Year 2976, a 
comprehensive cost analysis win be undertaken. Rrdliminary 
estimates are favorable. For example, computer connect 
tmsein testing in one alnSty area sow costs less than forty 
cents per examLiee. It is reasonable to expect that cost to 
drop as the program progresses. Qirrent pianscaU for fuBy 
operational coniputerized adaptive testing by 2980. At that 
time, it is ejq^ected that the exaniiution for most 
entxy4evel professional and administrative jobs will indude 
a test battery administered in the computerized adaptive 
system. Approximately 200jOOO applicants cunendy file 
for these jobi. It will take untO 1980 to ^et ready for an 
examination of this scope and number of participants. 

My colleagues this rooming will address some of the 
progress we have made m sohing technical probkms 
assodikted vAib the program. 
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A HVE-YE AR QUESi : 

B COMPUTERIZED ADAPTIVE TESTING PEASIBLE? 

VERNW.URRY 

UJS. GvU Sarice Commission 



Hve years oT leseardi on the feasbility of computer 
asasfed testing has attempted to answer four extremely 
s^flcant questions: (1) What types of items are required 
for effective computerized adaptive testing? (2) Do these 
types of items c^dst In sufHcient number to measure 
important abilities adequately? (3) Can estimates of the 
item parameteis be obtained that arc suflidently reliable to 
be used successfully in a computerized adaptive testing 
algorithm? and (4) Is there an efficient and accurate 
adaptive algorithm for computerized testing? 

In answer to the flist question^ '^at types of items are 
required for effective computerized adaptive testing?", the 
development of ^)ecincations for effective item banks or 
item pools for computerized adaptive testing was begun 
about five years ago (Urry, 1970). These specifications were 
written ymth reference to the three parameters of the 
nomiai o|^e model (Lord JS^Noyick, 1968) and the Ip^tic 
model (Bimbaum, 1968). At that timc» they included 
requirements for a minimum of 100 items with item 
discriminatory powers (the a^) of at least .80, with item 
difficulties (the bfi evenly distributed on the interval from 
-2.00 to 2^00, and with item coefficients of guessing (the 
C/) of .25 as a maximum. Some research was later 
completed (Jensema, 1974; Uriy» 19746) indicating that 
the maximum value for the q could be set as high as .30 
with item bankeffectiyeness still maintained. 

In these studies^ an item bank was adjudged effectivie 
when computerized adaptive testing required fewer items 
than conventional paper and pencil testing to attain the 
same level of reliability. The specifications were arrived at 
throu^ model sampHng and simulation techniques. The 
concern was the capability of the 3-parameter models for 
the specific purpose of computerized adaptive testing. After 
model capabilities wer^ adequately explored, there 
remained the empirical question, "Do these types of items, 
exist in sufficient number to measure important abilities 
adequately?" 

At first glance» it mi^t have appeared that the 
requirement for item discriminatory powers of J8 or greater 
was unreasonably high ^ven the usual test item because an 
item discriminatory power of .8 corresponds to a biserial 
correlation of .62 between the item and latent ability. In 
the experience of most psychometricians this would seem 
an impossible specification to meet, because the usual 
item-test biserial correlations tend to be much lower than 
this specified value. However^ the irr^osslbility^ exists only 



if the attenuating effects, of guessing on conventional 
indicants of item dscriminatoiy power, are not fully 
understood. These effects mask the true disciiminatoiy 
power of muItiple-chcHce items to a marked degree, axul 
they are still largelv unappreda ted. 

In order to iUustrate these effects, equations were 
derived for the point-bberial (Uny, i974tf) and the biserial 
(Urzy, 1975) correlations between muItip)e<h(Mce items 
and latent ability. The equation for the pcnnt-biserial 
correlation was derived as 

. (l'C^P/5»(7>) 

rr« = — ■ — 

(Urry, 1974a, eq. 15); (1) 



and the derivation of the biserial correlation resulted in 
, 0-C/)p/^^(r/) 

(yny,1975,eq.6).(2) 

In these equations, a prime was used to indicate that the 
gii'ien term was affected by guessing. Definitions were as 
follows: 

Cy the item coeffident of guessing, is the lower 

asymptote of the regression of the binary item 

on latent ability; 
Pjff is the biserial conrelation^ imaifectcd by 

guessing, between the binary item and latent 

ability; 

7/ is the baseline value of the item distribution 
MQjO Jbove y/ath the probability of (or 
* proportion) iqiowing the correct response 
occurs; 

^7/) is the height of the ordinate at-y/; 

Pf is the probability of (or proportion) passing a 

multiple-choice iteii]^ 
QI or i - P/, is the probability of (or proportion) 

missing a muItii^e*choice item; 
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7/ IS ihe bsseline value on the disinbutst^n A'(0,!) 
above wbkh Ihe probzbility of (oi pTppoiUon) 
passing, yiz./f, occurs: 
^Cyf) is the hd^t of the ordinate at 7,^ 
The difference betii%«en the probability of (or proportion) 
knowing the correct rcspcmsc to an item, viz^ 



1 " 

f exp 



\^ 7/ 



dt. 



(3) 



and the probability of (or proportion) pasang a 
multipk<hoice item, viz.. 



(4) 



is to be duly noted. As a consequence^ it is known that is 
equal to 7/ only when C/ is zero. When guessing is effective 
(or, synoopmously, c,- is not zero), neither 7/ and 7/ nor 
3(7/) and ^(7/) arc equal, -urther, when guesdng is 
effective, 7/, as a baseline value, is unlike 7,- which diwdcs 
the item (hitribution meaningfully on the oasis of success 
on the item- NoUce that for C/ equal to zero, equation C2) 
indicates the eouality of p/^ and p^^- Othenvise the 
distinction between these two coefficients is to be kept 
clearly in nund. Since item discriminatory power defined 
by the normal o^ve mode! as 



(5) 



it is totally inappropriate to substitute estimates of p/^ for 
Pie equation (5) to estimate tf/. When guessing is 
effective or vAitn the items are of a multiple-choice variety, 
this procedural error adversely affects computerized 
adaptive testing. 

The derived equations Tot the pdnt-biserial and bisertal 
correlations were used to illustrate the attenuating effects 
of guessing on these conventional indicants of item 
discriminatory power. In the procedure, the item- 
coefficient of guessing is usually set at some meaningful 
value, say, the reciprocal o? the number of alternatives for a 
multiple-chdce question; and for this fixed value of C/, the 
equations are evaluated to map the levels j>Ta^ and 6/ onto 
the planes defined by the coordinates, "^the point-biserial 
correlation and the p-value, or the biscrial correlation and 
the p value. In Figure 1, the levels of tf, wz., .8, l.<0, 1.2, 
L4, 1 .6, 2i), and 2 0, and the levels ofb, viz., 2.0, 1.6, - . . , 
-2,00, have been mapped onto the plane defined by the 
population point-biserial correlation and the population 
proportion passing or p-value for c equal to .20- When c is 
fixed at .20, the effectiveness of guessing is roughly 



eqpvalent to the levd typcal of S-altemative items. SSnoe 
tte biserial xoiieiation (unaffected by guessing) between 
the item 2nd laUnt ability is defined as 



(6) 



in the normal o^we model, the lewis of a portrayed in 
Figure 1, yiz^ IJO, 12, 1.4, 1.6, ZO and 3.0, correspond 
to item ability biserials of .62, .71, .77, .81, J&S, .89. and 
-95. Notice then the apparent psi^ox. For cwmple, an 
item which has an item-test pcwit-bisexial correlation of .1 1 
vrith 2 p-value of 22 is indicated to have an item 
discriminatory power, ffy, of 3.00 or a Pj^ of .95. The 
astonishing paradox is due to the attenuating effect of 
fi^jessing. In Figure 2, identical levels of a and b have been 
mapped onto the jJarie defir>ed by the population biserial 
correlation and the population proportion passing or 
p-value, again, for c fixed at ^0. While the attenuating 
effect is less pronounced for the biserial correlation relative 
to the pc»nt-biserial correlation, it is most se\i;re for 
difficult items. For example, a five-altemativie multiple- 
choice item with an item-test biserial correlation of .17 and 
a p-valLc of .22 is indicative of an item discrinunatoiy 
power of 3Xj or an item-ability biserial of .95 and an item 
difficulty of 2.00. What would happen if the procedural 
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Figure L Rclatwn^ip bctwicn conventional and 

.parameter^ y/licn the coefncknt of guessing (c) equals 
.20, 
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K>PULAT)OK PHOPOflTION PASSifG 

Ficurc^- RclationsJup between convcnlional and normal ogive item 
paiameteis when Ihe coefficient of guessing (c) equals 
.20. 

error afludcd to earlier were coiniratted in connection with 
ths interesting case? It mII be recalled that the error 
involved, the misuse of Pjq in equation (5). In this 
instance, would have been erroneously estimated as 17 
when the true yaluc 3.00. Obviously, gross errors of 
this nature render computerized adaptive testing less 
efilcknt than it ^ould normally be. If the data poin* 
deflned by the item-test point-biserial or biscrial correlation 
and tlie p-value is plotted on one of these maps or charts, 
the corresponding values of and for the pvcn item can 
be interpolated from the grid system that identifies the 
various levels of J/ and b^. For reliable total tests' and large 
samples, ths interpolated values of J/ and b^ approximate 
the true parameters and allow the researcher (1) to identify 
iteffis appropriate for the purpose of computerized adaptive 
testing and (2) to assess the efficacy of a ^ven set of 
appropriate items for the purpose of computerized adaptive 
testing by comparing the obtained interpolated values with 
the specifications for item bank effectiveness. When the 
specifications arc met, improved reliability per item used is 
assured for computerized adaptive tests relative to 
conventional tests. However, the number of items required 
in computerized adaptive testing relative to conventional 
testing ean be markedly reduced when the appreciably 



>As total test rcliability decreases, the approximations for the 
parameters tf/ systematically underestimate the true values of /7^. 



exceed the minimum value of .80, the b^ zxt widely and 
evenly distnbuted, and the are maintained at low values. 

Experience has diown (Jcnsema, 1972; Urry, 1974*) 
that roughly one-thiiu of the items in the usual aptitude or 
ability test survi\'e this screening for appropriateness. 
Moreover, item discriminatory powers have been frequently 
found to exceed 2.0 in value. . 

After it was ascertained that sets of items could be 
found tliat would satisfy the specifications for effective 
item banks, there remained the important question, "Can 
estimates of the Item parameters be obtained that are 
sufficiently reliable to be used successfully in a 
computerized adaptiw testing algorithm?" In answer to this 
question, a relati\'cly rapid and inexpensive item-analytic 
procedure was developed (Uriy, in press-tf). It has been 
programmed and is currently available for use on several 
computers. The output of the program is an item analyas 
yielding ancillary estimates for J,-, item discrimuiatory 
power; i/, item difficulty; and C/, item coefficient of 
guessing. 

Estimates of tlie parameters J,-, and are obtained by - 
an iterative, minimum X'Square procedure. The procedure 
condsts of two stages that differ only wth respect to the 
parUcuIar measure used for manifest ability. In the first 
stage, the distribution of manifest ability is represented by 
corrected raw scores where the item being parameterized is 
omitted from the sconng. In the second stage, the 
distribution of manifest ability is represented by Bayesian 
modal estunates of ability (Samejima, 1969). Generally, 
Bayesian jnodal estimates ^ of ability more closely 
3jpproximaie the distribution of latent ability than does the 
distribution of corrected raw scores. Therefore, the second 
stage constitutes a refinement on the first stage. In both 
stages the procedure iterates item by item through values of 
Ci to obtain pairs of J/ and b^ consistent with large sample 
estimates of the item-manifest ability point-biscrial 
correlation and the item /7-value. This allows thfe generation 
of various item characteristic curves (ICC's)^ The ICC's are 
then compared v/ith the regression of the binary item on 
manifest ability. The ICC that best fits tliis regression, as 
indicated by the minimum x-square, is pvcn by the set of 
approximations - % %y and % The approximations arc 
then corrected for characteristics of the particular sample 
of items being parameterized to obtain "ancillary 
estimates" -2/,^/, and'^. Ancillary estimation as a generic 
method was developed by Fisher (1950). The ancillary 
corrections improve the efficiency of the estimates. 

The procedure has been evaluated through model ^ 
sampling and simulation techniques. In particular, two 
, parameterization samples, one of 2,000 and one of 3,000 
cases,' were generated from the logistic model using 
specified, and hence known, item parameters. The data i 
were then analyzed by the procedure, and the resulting 
estimates were compared to the known parameters for each 
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of the samples. Specifically, root mean square errors 
(RMSE's),i.c. - 



\ and 



m 



• , were obtained. Jliese measures of 



deviation are given in Table 1 for the two parameterization 
samples and stages. Notice that the particular RMSE 
indicated by a given equation tends to decrease with st^^es. 
This is an indication of improved efficiency 
due to ancillary corrections. For the final stage ancillary 
estimates, these deviation measures were .242, .123 and 
.OSe/o*- the 2000 case sample, and .228, .148, and .056 for 
the 3000 case, sample. For 100-item parameterization tests, 
these data indicated that 2,000 cases were sufficient for the 
effective us^ of the procedure. Correlations were also 
computed bet.veeir the estimates and the known para- 
meters, i^., r^^, r^, and These correlations are ' 
provided in Table* 2 for the two parameterization samples 
and stages. Notice that there is a tendency for each 
correlation to increase with stages as predicted given that 



the ancillary corrections improve efficiency of estimation. 
For the final stage ancillary estimates, the correlations wer^' 
.915, .996, arid .764 for the 2,000 case sample, and .918, 
.997, and .760 for the 3,000 case sample. Since the ranges 
of the and C/ were somewhat restricted, these correlations 
arc very respectable^ The results of these comparisons 
between the estimates and the known parameters indicated 
the merit of the itenvanalytic procedure.. 

The ancillary estimation procedure was further evaluated 
using simulation techniques. In particular, testipg was^ 
conducted using a Bayesian algorithm developed by Owen * 
(1969). Samples of 1 00 cases eac h were generated for 
computerized adaptive testing using 100 items with known 
item parameters. In the generation process, values of fl, the 
ability parameter, are sampled randomly from ^( 0,1) and 
are also known. As a result, estimates of the ability 
obtained under computerized adaptive testing could be 
correlated with known ability. Comparisons of correlations, 
V<?> ^^re made across three conditions of computerized 
adaptive testing where (1) the known item parameters, (2) 
the ancillary estimates of the item parameters based on the 
2,000 case sample, and (3) the ancillary estim;|tes df item 
parameters based on the 3,000 case sample were used in the 
algorithm. Tlic appropriateness of the use of the^ancillafy_ 
estimates could be evaluated, tlierefore, by comparing the 
results obtained for the last two conditions .with those 



Sample Size 



TABLE 1 ' 

Root Mean Square Enors for Estimates by Paiameterization 
Samples and Stages 



Parameterization Stage 



Root Mean Square Error 



(|{.v.,H-)« 



2000 


Corrected Raw Score: 
Approximation 


.309 


.181 




• .077 




Ancillary Estimate 


.283 


.120 




. .06^ 




Bayesian Modal: *. • 
Approximation 


.269 


.150 




.061' 




Ancillary Estimate 


.242 


.123 




.056 


3000 


Corrected KawScore: 
Approximation 


.3Gf8 


.139 ' 


\ 


.081 




AncOUiy Estimate^ 


.253 


.135 




.073 , 




Bayesian Modal: * , 
Approximation 


.252 . 


.109 




.059 




Ancillary EstirQate 


.228 


.148" 




.056 



100 



TABLE 2 

ConeUtions Between Estimates and Known 
Paramctersby Parameterization Samples 
and Stages 



.Sample Sue 


raraTncieiiz«ii*un ow^c 






Correlation 














2006 . 

i — 


Corrected Raw Score: 
Approximation 


. ..876 




.996 


.651 




Ancillary Estimate 


.873 


- 


.996 


.668 


» « 


, Baycsian Modal: 
Approximation 


.909 




.996 


.754 


.1 


Ancillary Estimate 


.915 




.996 


.764 


3000 . . 


Conected Raw Score: 
Approximation 


.884 




.996 


.611 




Ancillary Estimate 


.895 ' 




.996 


.616 




Baycsiaa Modal* 
Approximation 


.914 


f 


.997 


.752 




Ancillary Estimate 


.918 




.997 


.760 



obtained for the first. In Table 3, the results are 
summarized for each of the conditions of testing. 

Further explanation, however, is iii order before 
proceeding to an interpretation of these results. When 
compared with conventional testing procedures, comput- 
enzed adaptive testing can lead to a substantial reduction in 
the number of items required to obtain a given degree of 

TABLE 3 



validity. Therefore, the concern was not only with the 
validity obtained but also with the economy in items 
observed in obtaining the given validity. Control over the 
validity of computerized adaptive testing is direct. When an 
individual is being evaluated, the standard error of the 
estimate of ability is available at any stage in the sequence. 
Validity, over individuals, is controlled by terminating the 



Validity Coefficlsnts(r^a). and Average Number of * 
Items (rt) Required for Tailored Testing to 
Various Termination Rules Where the Item 
Parameters Were Known or Estimated 



Termination Rules 



Item Parameters Estimated in 
a Sample of: ^ 



i 

# 






Pie 


1 


^ Mil 


.70 


.84 


2 


.5000 


.75 


.87 


■ 3 


Am 


.80 


.89 


4 


.3873 


.85, 


S2 


5 


.3162 


.90 


.^5 


6 - 


.2828 


.92 


.96 


T 


.2449 


.94 


.97 


8 


.2236 


.95 


'.97 



Parameters Known 



2,000 Cases 



n 



n 



3,000 Cases 



.84 
.85 
;89 
.91 
.94 
.96 
.96 
.96 



2.7 
3.2 
3.9 
4.7 
6.6 
8.2 
10.8 
13.3 



.83 
.86 
.89 
.90 
.92 
.94 
.95 
.95 



2.0 
2.7 
3.4 
4:0 
5.4 
6.7 

9a 

11.1 



.84 
.86 
.88 
.90 
.93 
.93 
t94 
.95 



2.3 
2.6 
3.2 
4.0 
5.6 

.1,1 
9.6 

11.9 
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individual sequences at a common value for the standard 
error of the estimate of ability. In the study, eight such 
termination rules were designated. These rules are identified 
in columns 1 and 2 of Table 3 and specify that the st^indard 
error of the estimate of ability, a^, was equal to or less than 
(1) ^477, (2) .5000, (3) .4472 (4) .3873, (5) .31(52,(6) 
.2828, (7) .2449 and (8) .2236, respectively, „ oyer all 
individuals. Given for any termination rule,synonbmous 
rules may be generated through 

and 

' Pde^yJ^'o\ ' (8) 



for "the expected reliability and validity, respectively. These 
synonomous rules are given in column 3 and 4. The 
validities of column 4 may then be compared with obtained 
'] validities. Eight estimates of ability satisfying these rules 
were obtained for all cases. Obtained validities were 
indexed by the correlations between known ability and 
estimated ability rg*^, for specified termination rules as 
appropriate to the testing condition. As the termination 
rule becon>es more stringent, the obtained validities given in 
columns 5, 7, and'9 increase and compare very closely with 
^expected validities given in column 4. Additionally, the 
average numbers of items required, the /2, given in column^ 
6, 8, and 10 also increase as the termination rule becomes 
more stringent. Notice that the n at each termination rtile 
differ only slightly across testing coriditions. Since the 
results were almost identical across testing conditions, the 
jtcm-analytic procedure appeared very appropriate in 
computerize^ adaptive testing applications. Consequently, 
ancillary estimates of the item parameters based on more 
than 2^000 cases and 100 items were strongly recom- 
mended for use in computerized adaptive testing. 

Further research in evaluating the item-analytic pro- 
cedure has been accomplished for varying numbers of cases 
and items (Gugel et. al., 1975), and more .detailed 
recommedations regarding the use of the procedure vvill be 
given later in the conference. 

As it turned out, the last signiiicant question, *is there 
an efHcient and accurate adaptive algorithm for comput- 
erized testing?" could ^ have been answered in the 
affirmative as early as 1969. The important event was the 
publication of an Educational Testing Service research\ 
^bulljjtin, **A Baycsian Approach to Tailored Testing", by \ 
Roger J.Owen. Subsequent research (Urry, 1971, 1974&,in . 
press^;. Jcnsema, 1972, 1974, 1975) has shown the 
efficiency and accuracy of the algorithm. For example, it is 
possible to construct some 2,000 computerized adaptive , 
tests in some J 7 minutes of central processor unit time, and 



the precision of measurement can be accurately controlled 
With termination rules. 

In summary, we now find that. (1) the specifications for 
effective item banks have been developed, (2) these 
specifications- can be met for a number of significant 
abilities, (3) efficient procedures exist for the .reliable 
estimation of parameters, and (4).an efficient computerized 
adaptive testing algorithm is available to conduct the actual 
testing. All the necessary prerequisites for the success of 
computerized adaptive testing are therefore now in- 
evidericc. At this juncture, the feasibility of computerized 
adaptive testing can be realistically assessed, and this 
realistic assessment is decidedly and resoundingly affirma- 
tive in nature. At present, computerized adaptive te^sting 
appears to have a future without parallel in the literature of 
psychological measurement." 
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EFf ECTIVENESS OF THE ANCILLARY ESTIMATION 
PROCEDURE ' 



JOHN F. GUGEL, FRANK JL SCHMIDT, AND VERN W- URRY 

US. GnlSafice Commisnon 



Un> {1974a) has presented ^ graphic xnethixl lu 
provide approxunatiooj fc» the item paiameters of the 
Aonml oghre and Bimbauniiog^c thiee-paiameter htent 
trait models. Thu method has smoe been further developed 
OJny, 197^ to' provide more accurate computstionzl 
procedure for estimating the three parameters^ Qtem 
discnminatory power), (item difllculty), and (item 
coefiident of guessing). Programmed for the computer* this 
technique produces parameter estimates quickly and 
inexpensively. 

Initial studies of this procedure en^loycd large sample 
sizes (iV=2000 and 3000 cases) and a relatively hrgie 
number of items (/r=100). Uider these conditions, the 
procedure produces very accurate parameter estimates 
(I7ny^ 1975). We are now in a position to examine the 
effects of reduced numbers of cases and items on error in 
the parameter estimates ^d on the accuracy of tailored 
testing using those estimates. It is known a priori^ of course,, 
that reduction in either the number of cases or the number 
of items vdll, other things being constant* tend to increase 
estimation errors. But it is not known at present how large 
or practically sgniflcant such increases would be. The 
pre^nl study, exploratory in nature, is addressed to these 
questions. 

METHOD 

Based on suggestions by Lord (1968, p.^1016) and the 
results of the previous study by Urry (1975). it was decided 
to allow the number of items to vary from 50 to 100 and 
the number of cases to range from 500 to 2000, The initial 
lOO^tem bank, frorn^ ^rfiich the smaller banl^ were later 
selected, was characterized by values rangmg uniformly 
from .80 to 2.20, values distributed uniformly from - 1 9 
to +1.9, and values from i)2 to 24, also uniform in 
'distribution. These parameter values are not different from 
what one mi^t reasonably expect to find empirically ^ver) 
prescrccning of items (Urry, 1974tf; Jenscma, 1972). In the 
reduced item samples, the values were chosen in equal 
steps from .80 to 2.20, For example, there were five levels 
ofa^ for the 504tem test and ten for the 1004tem test. Ten 
values of 6^ in equal steps between -1.9 and 19, inclusive. 



'Computer procdsing for this study was done at the Universty of 
Mj^yland C(>ii^>uter Science Center in conjunction with graduate 
work by John G^|cL.Amn|emnU for computer time were nude 
by Frofe«or Ourics Johnson ol the Department of Measurement 
uid Statistici, Coflefe of Education, Umver»ty of Maryland. 



were arranged within each level of a^. (an exception was the 
SSitem test, which had eleven values of in equal steps 
between ^1-9 and 15, induave, within eadi of its 
nlues.) For different levels of /t,, items were matched on 6, 
values. The values ranged from .02 to 24 in equal steps, 
irrespective of tfj and A,. Values of ^, representing emulated 
subjects, were sampled randomly from //(0,1). Then for 
each the Emulation procedure described by Urry (1975 
was used to generate, a vector of responses (1 = c<Hrect;0 = 
Incorrect) for the item bank in question usng the known 
item parameters. Eanjneter estimation wras then carried out 
vsn% tlursmubted data. 

Two indices were used to evaluate the parameter 
estimates relative to the jcnown parameters. First, the root 
mean square error (R.MSE) was computed for the estimated 
parameters. The formula for this statistic, is; 



R.^1SE 




where the p = known values of iz^, b^, c^, or p/^ , and 

n = number of items mvdved in the particular 
analyses. 

Second, Pearson correlations between the known and 
estimated parameters were computed, i.e., r^^. 

To illustrate the effects of error in the parameter 
estimates on the accuracy of tailored testing, Owen's 
(1968) algonthm was employed. Specifically, tailored 
testmg was earned out on 100 simulated sitbjects using first 
the known item parameters and theii item parameter 
esumates obtamed on 1000 cases and 60 items. To increase 
the number of items used in tailored toting to a more 
realistic level, another identical set of 60 items was 
parameterized on a separate, independent group of 1000 
simulated subjects, and these "items'* were^ combined with 
the original 60 to produce a , bank with' 120 items. In the 
case of the known;^.parameters, both 60-iten>sets were 
entered into tnc tailored testing bank* The known 
parameters in this bank were used to generate the response 
vectors of the MX) simulated subjects, and these vectors in 
tum, weJe used in thc'tailored testing. Correlati(His between 
estimated and actual B were computed at each of eight 
termination rules for each condition of testing. This 
allowed a comparison of correlations across the conditions 
of testing, i.e., where (1) known or (2) estimated item 
parameters were used in the tailoring process. * 
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RESULTS ANDDISCUSSiON 
K^siitts produced hy the pauametehzation procedure fcsr 
wymg conibinatioas of sample sat znd nuiraber ci items 
are aiiown in Tables 1 and 2. In l>oih Ubles, ^Raw Score 
Estimates** refer to the paraxneter estimates prior to 
apfrikatjon of the snd^ny correction proceduie, and the 
cduinns headed "Final Estimates^ refer to estimates after 
application of the corrections. Table 1 indudcs the S.E. for 
pj09 the correlation betnten the continuum tmdcdying the 
item and 6, as weD as for b^^ and <f . **Lo$t items** arc 
thoie for «4uch the estimition procedure did not converge 
because of xnsufndent cases in the tails o^ the distribution. 

Looldng at the S£.*s for the final estimates in Tab^e l.it 
can be se^ that, in ^eral, decreasing both sample size and 
number of items results |n tncieased RMSFs. This effect 
appears to be moie pronounced for than for the other 
parameteis. Moving from 50 to 60 items (sample ^ze 
constant) appears to produce marked .reductions in error 
for ff/, but beyond this, improvements in accuracy with 
inoeases in nuniber of items are smaller. The bi and c/ were 
estimated rather accurately threii^out the range of both 
independent variables* although variation in sample size 
and number of items, did have the expected effect. The last 
cdtmm in Table 1 reveals a tendency for items to be^ to 
ful to converge during parameter estimation when sample 



size is dropped as low as 500. Sample size appears more 
crudal in tl& respect than number of items. Conelations 
betu'een £na] parameter estimates and actual parameters, 
siiown in Table 2, also pttem themselves as expected, 
within the Jimlts of sarnpUng error. In examirung these 
cosrelations, one should bear an mind that in the case of 

and to a lesser.extent restriction in range is operating 
to loMr the taUed values. The items parameterized 
contained no values of lower than ^80. This value of ff/ 
corresponds to a tnserid correlation of jS2 between the 
item and latent alnfity. Fast studies (Jensema, 1972; Uny, 
19746) have shown that only about one third of the 
items in conwntiorul tests have values this large. No C; 
greater than JA were included; in practice does exceed 
24, althou^ the range restriction here is probably less 
severe than in the case of tf/. 

Results of simulated tailored testing usng known 
pararrKters and parameters estynated on a sample of 1000 
with 60 items are shown in T^Ie 3. The d^t termination 
rules, expressed as the standard error of estinzate (o-) are 
seen in column 2. Column 3 trandates these values to 
reliability coeHidents for phased on the relationship 

p|fi = l-<';^ (2) 



TABLE I 

Root Mean Square Errors (RMSE) 
Before and After aS Corrections 



Raw Score Estintates Final Estimates 

RMSE RMSE Lost 



Items 


Cases 






h 




^10 


^/ 








Items 


50 


2000 




283 


-124 




J043 


395 


.137 


M4 


MS 


0 


50 


1000 




292 


.193 


' Ml 


X)53 




.209 


.078 


MS 


1 


50 


500 




370 


.164 


mi 


J067 


.4,2 


.259 


mi 


M4 


0 


55 


2000 




-385 


.195 


.091 


.061 


.308 


.150 


Ml 


M3 


0 


55 


1000 




352 


-194 


.101 


.050 


J15 


.124 


.071 


MQ 


0 


55 " 


500 




az\ 


.185 


-098 


-054 


.403 


.227 


.086 


.065 


4 


60 


2000 




321 


^04 


J091 


J056 


.253 


.140 


M5 


j040 


0 


60 


1000 




-343 


331 


.089 


.059 


J22 


.144 


M2 


.044 


0 


60 


500 




.360 


.194 


-080 


i)70 


.342 


.179 


MZ 


M2 


. 0 


70 


2000 




^72 


.131 


.095 


Ml 


^25 


ASS 


Ml 


j040 


1 


70 


1000 




J24 


.189 


-095 


.054 


^73 


314 


xn4 


.045 


^ 1 


70 


500 




J86 


.197 


X96 


.072 


.351 


.187 


.083 


.058 


4 


80 . 


2000 




^66 


.141 


.092 


MS 


.214 


.150 


sn2 


.039 


1 


80 


1000 




359 


.178 


X92 


J048 


.261 


.166 


.073 


.047 


1 


80 


500 




319 


^24 


-091 


.063 


Jll 


.229 


^i)79 


.048 


6 


90 


2000 




397 


.180 


.094 


M9 


3AA 


.149 


M9 


J03S^ 


0 


90 


1000 




341 


.171 


X89 


Ml 


-304 


.140 


J012 


M4 


0 


90 


500 




31C 


.184 


X94 


MS 


.283 


.144 


MS 


M9 


2 


100 - 


2000 




.290 


.138 


M5 


.049 


.223 


.131 


MS 


.036 


^ 6 


too 


1000 






.137 


.088 


M2 


.240 


.162 


M2 


.039 


.0 


foo 


500 




:i54 

* 


.189 


.100 


Ml 


.276 


.161 


M3 


Ml 


5 



104 



110 



TABLE3 



Before znd After All Concctkms 



ftar Score Exirnaics Finil Estisutcs 

Cases 

50 2000 S91 JS36 

50 1000 ^88 S92 .429 508 S90 .492 

50 500 .745 S93 -428 -780 589 -454 



55 


2000 


-731 




.458 








55 " 


1000 


.758 


S9S 


.428 


^70 


.995 


.546 


55 


500 


.650 


S91 


387 


.824 


J990 


.376 


50 


2000 


£2B 


S96 


v49l 


.899 


-997 


.630 


60 


1090 


-771 


394 


.546 


S42 


J995 


.581 


60 


500 


.768 


394 


j626 


Ml 


39S 




70 


2000 


^34 


397 


.471 


322 


391 


jsn 


70 


1000 




396 




J28 


396 


-521 


70 


500 


-715 


393 


A64 


.813 


395 


-449 


80 


2000 


-873 


396 


.535 


314 


391 


.574 


SO 


lOGO 


J850 


394 


.465 


SI9 


393 


.559 


80 


500 


J839 


391 


^10 


.823 


3$9 


.502 


90 


2000 


J861 


396 


AS3 


J871 


396 


.568 


90 


1000 


-757 


395 


JS12 


.;847 


395 


.547 


90 


500 


J804 


395 


Ml 


J874 


393 


.418 


100 


2000 


.837 


391 


^39 


^2Z _ 


39S 


j690 


100 


1000 


-843 


396 


A10 


.863 


396 


.627 


100 


500 


.741 


393 


344 


^24 


394 


.420 



The s(juarc root of this value is pg^, the contlation 
between the latent ability estimates (6) and actual btent 
ability (0). Validity coefHcients of this sort are ^ven in 
colunvis 4, 5, and 7. Those in column 4 are theoretical 
validities based solely on the temunation rule chosen. 
Those in column 5 were obtained by correlating the 6 



produced using thf known item paranieters with known C 
As expected they are essentially identical to the predicted 
theoretical validities. Those in coluimi 7 were obtained by 
correlating the & produced using the parameter estimates 
vrith the known 5. As expected, they are somewhat lower 
than those in volumns 4 and 5, but it can bcjiotcd that, as 



TABLE 3 

Validity Cdcffidcnls (r^g), and Avcn$t Number of lt«n$ 00 Required for 
Tailored'Tcjtins to Various Termination Rules Where the Item 
pjiamctcrs Were Known or Estimated 



(1) 


{2\ 




(4) 


(5) 


/6) 


Termination^Kules 


FarametersKnoiVA 








Pie 




n 


1 


Mil 


.70 


.84 


.864 


2.43 


2 


.5000 


.75 


.87 


.904 


3.31 


3 


M12 


.80 


.89 


.932 


4.00 


4 


.3873 


.85 


.92 


.935 


4.91 


5 


J162 


.90 


.95 


.955 


7.03 


6 


.2S23 


.92 


.96 


.962 


8.77 


7 


.2449 


.94 


.97 


.969 


nil 


8 


.2236 


95 


.97 


.975 


14.51 



Tarzmetcrs Esuma^ 





n 


.792 


2.76 


^21 


2X9 


.821 


2.89 


J664 


3.70 


.895 


5.30 


.921 


6.57 


.942 


8.91 


.952 


11.12 




ihe tennnatjoo nde becomes more strin^nt, the 
discrtpaGC>' deotases* At the most stiin^nt tsimmzum 
r«)fc, the V2&di!y of the ^ deiived using ike piiameticr 
estutates :s on]> j023 lowez fhan ihst based oq the l^oi^ii 
paiaincters. The reliabilities of the two fl's at this teirmna- 
tion rale arc .95 and 51 , respectively. 

Why are the tcnninalion ndcs cot. fully attained ihiicxi 
the paraineter estimates arc used? The taOoiing al^dlhm 
capitalizes on errors in the parameter estimates. As a 
conseqtmice, tslored testing vsng the cstitnated paia* 
naeters terminates prior to actually reacmng the pre-set 
termiDatioo rule. That is, because of capitalization oo error 
in parameter estimates during the process of item sdection, 
the reliability kvels implied by the Owen algorithm at 
any stafe during the tailoring process are someif^t 
inflated. This leads to a too eariy terminatioQ of tailored 
testing, and, trfien the obtained i are corrthti^ uith 6^ it 
becomes evident that the pre-set leliability level for 
termination has not been met. In the present example, an 
awtnge of J4.51 items was adnonistered when the known 
parametets were used but only 12.12 when the parameter 



estimates were used. This shrinkage problem can be 
avcwom by setting the reBa1»lit> tenxfnati(»i tule ingher 
than that actually jeqmred. In our present example. Jhe 
tenxxmation rule dould be set at .95 in order to obtain B of 
reliability Sl^ 
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ITEM PARAMETERIZATION PROCEDURES FOR THE FUTURE 



Failure to apprccute the inipuitanl ps> Jit^metric ruk 
played by guessing sn cunvcntiooal multipk Aoiut tests 
prevented until revcntl> piacDLal^ipj^uztianofbtentttat 
thtory to tailored testing, l^lien this problem was properly 
addressed^ it was found that the sohiUon could be 
expanded to produce an inexpensive and hig^ accurate 
item parametenzation procedure. Combined with Owen*s 
(1969) elegant Bayesian algorithm and a^'ailable CRT 
hardware, these developments made computer-assisted 
tailored testing feasible from a practical pcHnt of view. 

The capadty to parameterize new items for pos&ble 
later inclusion in the item bank during routine operation of 
the computer-assisted testing system would be asi^ficant 
step in the direction of even greater practicality (JGHcross. 
1974). Such a procedure vrauld elinunate the necesdty for 
periodic application of the full parameterization process 
described by Vny (1975<r, 1975*). The Urry andllaiy 
estimation procedure can be modified to pro^ade the 
capability to parameterize items in the environment of 
a hve, large-scale^ computcr-mteractive tailored testing 
system or network. It can thus provide a convenient 
technology for updating and expanding item banks in 
ong(»ng tailored testing systems. 

Ihe parameterization procedure is as follows: In addi- 
tion to the items that are part of his tailored test, each 
examinee receives a group of additional experimental Itenis. 
On-line andllaiy parameterization can b^n for any of 
Hliese Items as soon as a suHicient numbei of examinees 
havie responded to it. For each item,^^ ^ computed 
against the uniformly reliable Bayesian 0 from the Owen 
algorithm. (Notice that the item does not enter in any way 
into the determination of $.) P/ is ^timated in the usual 
way using sample data. The 0 are next grouped into k 
mtervals. Provisional values for C/ arc assumed, and the 
mimmum procedure is appbed to obt^n approximations 
of a^, and c,-. These procedures have been outlined in 
Urry (19756) and are dcscnbed in full in Uny (1975a). 

The purpose qf this study was to evaluate the on-line 
ancillary parameterization process uang model sampling 
and emulation techniques. The one hundred items to be 
parameterized were those used in the earner Gugel study, 
and are shown in Table L (In practice, a much smaller 
number of items would typically be parameterized, but for 
evaluation purposes a larger number is desrable.) 
Dependent variables in this study were also the same as 
those in GugeFs study, correlations between known ^d 
estimated parameters and the square root of mean squared 
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dcmtions of estimated from known parametcB- Inde 
pendent ^arublcs are illustrated in Figure L Two differeiit 
banks were used in tailored testing to produce the Owen 6, 
designated as the Verbal AbHity Bank and the Ideal Bank 
The Verbal Ability Bank of this study conrisls of the 103 
most frequently used items Q^ased on counts from previous 
Emulation studies) from the Comsussaon's 20(Htem Verbal 
AtnHty Bank. The Cbmmissioa's bank in tum,is made of 
the best 200 items out of 700 i«ibal ability items cah*brated 
by Uny (1974). Cdibration was carried out on large 
samples and the final ISO items w^ chosen to provide a 
wide distribution of bi values, hi|^ Of values, and low 
O)elow JO) c/ values. The 103 item bank used here thus 
represents a.cunently attained-thou^ improvable -level of 
quaHty. The Ideal Bank is the same 100 items being 
parameterizedKSee Table 1). Three different tenrnnation 
ruks were examined for the Ideal 3ank; for the Verbal 
Ability Bank, the most stringent rule (55) was onutted as 
impiacticaL Sample azes of 1000, 1500, and 2000 were 
examined. Simulated subjects (d's) were sampled and their 
re^nse vectors generated as in. the Gugel study. (This 
procedure is described in full in Uny [I914a\). 

RESULTS AND DISCUSSION 

The obtained standard errors for the Ideal and Verbal 
Ability banks are shown in Tables 2 and 4, respectively. 
Tables 3 snd 5 present the correlations between actual and 
estimated item parameters. In most cases, changes 
associated writh variation in the iridependent variables were 
in the hypothedzed direction. Increasing the number of 
subjects and the reliabilities required for termination of 
tailored testing usually resulted in lower standard errors and 
higher correlations between known anS^ estimated para- 
meters. Some deviation from this pattern occurred because 
cf sampling error. (For each bank, a different sample of 
simulated subjects was used for each termination rule and 
sample aze examined.) The same is true of the andllary 
corrections: the effect was generally to decrease standard 
errors and increase correlations/but because of sampling 
error this was not always the case. 

In examining the correlaticms between known and 
estimated paranieters, one^ould bear in mind that in the 
case of i/, and to a lesser extent c/, restriction in range is 
operating to lower the tabled values. The items 
parameterized (See Table 2) contained no values of tf/ lower 
than .80. This value of corresponds to a biserial 
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Correlation uf .62 bclwccn the item and latent abiht> . Past 
studies (Jcnscma, J 972, Urr>, 1974) have shown that onl> 
about cmc'thiid uf the items m vonventiunal tests have 
values this !aige. No greater than .27 were in Juded, in 



practice does exceed .27, although the range restriction 
here is probably not as great as in the case of j^. 

The rather hjgh J| values among the items paiamelerued 
.Tiust be considered also m evaluating the rmitmean square 
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ITEM BANKS 

Cttl* 
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'Reliability V2!ucs for icnninatioQ nxles. 
Figure 1. £xpcmncntil Dcapi. IndcpeocJcnt V&nablcs 

enors for ff/- Errors m 5/ arc much larger for high tf/ lhan 
low ij/, since when j/ is high, small errors in p/^ lead lo largp 
errors in 5/. For example, if p/^ = -90. - 2.01. If p/^ = 
.88, Si = L85, a difference of .16. Bui if p/^ = .50, £7/= .58, 
Then if p/^ = .48, 5/ = .55, a difference of only .03. 

The real test of the usef..;iessof Ihc on line paramcten 
zation process lies in the performance of the parameter 
estimates in eailorcd tesUng. The better the estimates, the 
closer they will come to equaling the performance of the 
known parameters. The parameter estimates obtained in 
this study have not yet been used in simulated tailored 
testing, but an idea of how well they would perform can be 
obtained by examining the performance of parameter 
estimates from Gu^l ct aL (1975) with roughly equhralent 
errors. Table 6 compares root mean square errors and 
correlations between known and estimated parameters from 
the present study for the Verba! Ability Bank with 2000 
cases and reliability cut-off of .93 with the results obtained 
by Cugel et al. (1975) using 1000 cases and 60 items with 
the full garamctenzation process. Except for the standard 
error of b {which is lower) and r^^ (which is also lower), his 
results are essentially equivalent- Uanga reliability cut-ofT 
of .95, Gugel et al. conducted simulated tailored testing 
using both the known and the estimated parameters. 
Known parameters produced rj^ = 5752, exactly corre- 
sponding to the termination rule (i.c., [.9752]^ = .95). 



Wih the parameter estimates, was ,9516, corre^rading 
lo an obtained reliability of .9W4. 

Because the tsdloiing algorithm capitalizes on chance 
errors in the parameter estimates, laHorcd testing imng the 
estimated parame^rs is terminated prior to actually 
reaching the pre-s» tenmnatiai rule That is, because of 
captalization on error in parameter estimates during the 
process of item selection, the reliability levels co*Tiputcd by 
the Owen algorithm at any sta^ during the tailoring 
process are somewiiat inflated. Tliis leads to a too early 
ternunation of tailored testing, and, when the obtained 
6 arc correlated with 6, it becomes evident that the pre^t 
rdiabiUty level for termination has not been met. In the 
present example, an average of 14.57 items was 
adnunistercd v^en the known parameters yXTt used but 
only 11.12 v^-hen the parameter estimates were used. This 
shrinkage probkra can be overcome by setting the 
reliability termination nilc his^r than that actually 
required. In our present exampfc, the termination rule 
should be set at 55 in order to obtain 6 of reliability .90. 
Simulation studies pro>ide a convenient-and perhaps the 
only-method of determming in advance of actual use the 
amount of ^rinkage to be expected when items are 
parameterized on p^n sample sizes and with gjven 
numbers of items. Tfe. shrinkage problem here is thus 
somewliat different from that characterizing, say, multiple 
regression, in that its effecte' can be cancelled out by 
appropriate selection^ of ternunation rules. Two points, 
however, ^ould be noted here: 

L Farameterizing on large sample ^zes (both numbers 
of items and numbers of cases), and thus obtaining 
more accurate initial parameter estimates, is prefer- 
able where feasible to adjusting termination rules to 
allow^for-shrinlagc . 

2. For certain tailored testing usages- for example, 
battery tailoring or mulUvariate taUored tcsting-the 
advantages of parameter estimates that can fully meet 
pre-set termination rules become substantial. Thai is, 
adjustment of termination rules to allow for shrink- 
age becomes, at best, inamvenient and awkward. 
In light of these facts, an important question is ^rliether 
or hot the on4ine parameterization process can produce 
estimates vwth errors low enough to reduce shrinkage to 
negligible levels. An important conaderation, of course, is 
the quality of the item bank on which the original B are 
derived. By parameterizing and adding to the Verbal Ability 
Bank those items vkWch were erroneously rejected earlier 
on the basis of low pdnt-biserial and biserial item-total 
indices, it will probably be possible to make the Verbal 
Ability Bank equivalent to the Ideal Bank used in this 
study. By increasing the number of cases to 3000, or 
pcrfiaps beyond 3000, it should be possible to reduce the 
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TABLE 3 

Conelations Between Known and Estimated 
S^meters-Idca! Bank 
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TABLE4 

Root Mean Square Drors* For Item Parameter Estimates And 
Uang the Verbal Ability Bank 
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where p - parameter, 

n = number ofitents- 
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TABLES 



Coirshtions Between Kmnm AnA Eslimsted 
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TABLE 6 

Comparison of Gu^I Resuits with Present Study Results 



Git5cl(1575)* 322 
Present Study** .331 



Root Mean Square Errors 

.140 .062 j044 

.250 J072 .045 



Correlations (r^p) 

jk 

^V/ ^^/^/ 

M7 .995 ^88 

^79 a .996 .591 



*N= 1000, 60 items; full parametexiiation procedure. 
♦*Verba! Ability Bank. Ar=2000* Rehibility cut-off = 33. 



root mean square errors sho^xi in Table 2 (2000 cases, cut 
off at .95) to levels comparable to those obtained by Uriy 
(1975; with the full parameterization process (2000 cases, 
100 Items). Uro''s root mean square errors were .242, .123,. 
and .056 for 5/, Si, and c/, respectively. At this level of 
accuracy, little shnnkage was m evidence. It should be 
borne in mmd that, in the case of the on-line parameteri- 
zation process, the number of cases can be increased at 
little or no cost. Also, as the quality of the bank is 
increases, more stringent termination rules can be intro- 
duced, further increasing accuracy of the on*Iine parameter 
estimates. 

A final modification of the on-line parameterization 
process can be made which should further reduce estima- 
tion errors- As the parameterization procedure is presently 
set up, those examinees wiiose 6 do not attain the 
terminatiwi rule reliability within 30 items are dropped 
from the sample. Because coverage of ^0 is weakest in the 
Verba! Ability Bank in the low ranges, -the dropped 
subjects tend to be concentrated in the low end of the 
distribution. This creates a paucity of information in a 
ranff in which many c/ values are deternuned, leading to 
hi^er C/ errors- Also, vfhcn the truncated distribution is 
rcsUndardized, the result is a displacement of the values 
In the case of the Ideal Bank, no subjects were dropped at 
the .91 and .93 termination rules. Even at the .95 



ternmution rule few exaninees failed to rcadi tlie criterion 
(10 at AT = 1000, 8 at .V= 1500, and 9 at Ar = 2000). In 
the Verbal Ability Bank, no subjects were dropped at 91, 
but at .93, 23 were dropped at 1000, 53 at;V= 1500, 
and 40 at ;V = 2000. Thus, up to 3.5% were eUminated- This 
probably cxplaWto a great extent the failure of the .93 
termination rule*to produce noticeably better estimates 
than the .91 nife (Tables 4 and 5). Estimates would 
probably be improved by retaining in the sample those 
subjects who fail to reach the ternnnation rule within 30 
items. Althou^ these 0 are less reliable, they probably 
pro\ide information at low 9 which is useful for parameter- 
ization purposes. 
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INVITED DISCUSSION 



DR. FREDERIC M. LORD 
Educational Testing Service: 

h 15 appropnate that m> di^vovsiun ^ould be express J 
m ihc flist pcrsun singubs to conixnaalJ> remind >«^u Atd*. 
I am giving m> u^vn upmiuns, whivh ma> be bused, ^n^x I 
am not a dmntcrcsicd party here. There ha%c been man>. 
many important pumts made dunng these sessions. I ha>e 
chosen 14 points to emphasise m my discussion. 

i. Ciiff (Note I ) wntes. It is felt that our furmulatsun 
will provide tlie framework fox a test theory which is more 
appropnate to the interactive case than either the classical 
or tracehne theunes are/' I am sure he would not want this 
challenge to ICC thcoiy to go unanswered. Cliff pioposes 
thai the appropnate model for the item ic.>punscs is the 
Guttman scale. 

Since the Guttman scale is a special case of the more 
general logistic or normal ogive item charactenslic cur\c, I 
cannot see how the Guttman scale car be called a more 
appropnate model than the logistic or normal ogive. If the 
Guttman scale were the correct mode!, the fitted logistic or 
normal trace Imes would come out xn the Guttman form. 

The Guttman scale assumes that the tetrachonc correla- 
tion between any two items is i.OO. This value may be 
approximated for certain attitude test data, but for 
aptitude and achievement test data typical tetrachorlc item 
mtcrcorrelations arc usually less than 0.35. Tliis is so very 
difTerent from LOO that 1 cannot see how the Guttman 
model can be considered acceptable for aptitude and 
achievement tests. 

2- Consider the problem of testing and assigning new 
armed forces recruits. One recruit, perhaps, should lake a 
complete battery of tests to deterniinc his suitabdity fur 
officer training school. The next recruit, however^ should 
be quickly extricated from this battery uf tests and perhaps 
given a battery of mechanical aptitude tests. How can we 
use adaptive testing to route a new recruii through many 
such battenes of tests efficiently, with a minimum \^aste uf 
time? Glenn Bryan raised this important question with me 
some years ago. Il seeit^sas if adaptive testing should -be an 
excellent way to deal with this problem. Yet the situation is 
so multidimensional that current theory doe^ not tell us 
how to proceed. Here is a very important uns*\lvcd 
problem. 

3. Waters has pointed out and documented something 
that some of us had overlooked-that an adaptive test 
should be expected to take longer to administer than a 
conventional test with the same number of items. The 
reason is that the conventional test contains items that are 
too hard or too easy for each examinee-items that h&can 
answer (or omit J v/iihout need for lengthy consideration. 
Studies of adaptive testing wdl have to take testing time 
into account. 



4. There is one situation in adaptive testing (or 
^me uthex unwonventiunal procedure) is really Indispens 
able. Suppose it Is iievessaiy to ha^^-e good measurement 
iner an unusually viide range of abQity. As a first step, one 
mi^i build a conventional type of test with extra easy 
items added at one end and extra hard items at the other, 
50 as to have some items tliat are appropriate In difficulty 
for each ability level. Of course, the easy items are a v^te 
of time for the high level examinees, but that is not the 
serious problem. The hard items arc not merely a waste of 
time for the low^level examinees. The guessing of low-le\'eI 
examinees on the hard items adds so mudi noise that the 
measurement provided by the easy items isnearly drowned 
in random error^ 

In such situations, it can be shown that the test would 
be much improved as a measuring instrument for low-level 
examinees if we simply threw away (or refused to score) 
the more difficult half or two-thirds of the test The 
situation cannot be remedied simplyty adding more easy 
items. If wc wish to obtain good measurement at lowr a« 
well as at high ability levels, some kind of tailoring is 
necessary so that hard items are not administered to 
low-level exaipinees- - 

5. If total testing time is held fixed, adaptive testing 
leads to better measurement for some examinees. If 
accuracy of measurement is held fixed, adaptive testing 
leads to reduced testing time for some examinees. These 
tv/o alternatives are not basically different. 

Keeping the standard error of measuremci^t fixed across 
examinees would be simple if the test were very long or if 
we knew the true parameter values, and if all items had 
identical characteristic curves. Otherwise there may be 
difficulty in finding a good small sample theory and 
method. Gugel and Schmidt have given empirical evidence 
of this. This is a problem in sequential estimation 0^'ald, 
195 J, Robbins, 1959. Bickel & Yahav. 1968). Except 
perhaps for Bayesians, methods of sequenti^estimation are 
not as well settled as are methods of sequential hypothesis 
testing. Even sequential hypothesis testing poses unsolved 
problems when the ite.ms do not all have identical charac- 
teristic curves. 

6. It is undoubtedly significant that most of the 
speakers here are using two- or three-parameter item 
characteristic curve models. No one tjere has urged that 
adaptive testing be limited to the one parameter Rasch 
model. 

U is sometimes asserted that the Rasch model is the only 
one that allows us to estimate examinee ability independ- 
ently of the items administered. I would argue that all ICC 
models allow us to do this. The unique^virtue of the Rasch 
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model is that It provides a sufiiJentsUtistic fa; estimating 
examinee ability. Stiflicient statistics arc demable. but they 
2ie not common in statistical work, outade of the usual 
ndrmal-curvc lheoi>- Slaliitival inference still proceed* ^try 
efTectively In the absence of suHident statistics. 

The objection tisuaily dte4 against the Rasch model is 
that it assumes all items ta^be of equal discriminating 
, pow^r. I suspect ihat'an even more serious objection is that 
it assvimes there is no guessing. Any attempt to modify the 
Rasch model ta take guessing into account would necessar 
ily destroy the sufliciency properties of the Rasch model 
that make it attractive. 

7. Tliis brings us face to face ^th the question whether 
to use a twor or a three-parameter ICC model. Waters used a 
two-parameter normal-ogive model and the assumption that 
ability is normally distributed to estimate the >7 parameters 
(discriminating power) of the 50 verbal items in Form 2B 
of SCAT 11. By chance, I had available estimates of the 
same parameter^ ba^d on the three-parameter logistic 
model, computed by a program called LDGIST (available 
on request). 

I have plotted Waters* values against the LOCIST values 
m Hgure 1. Each pumi is shown as a digit representing item 
diflicult>. The lar^ci the digit, the niure difiiculi the acm 
and the more the examinees' lesponse^ aic^ affected by 
guessmg. Agreement guod unl> foi tlie eas> items wheie 
there is no -guessing. 

Many studies wumparmg different estimation methods 
should be earned uut. Sunie should use teal data, ^oine 
should use artificial data, where the true parameters are 
known. I should be glad to run on LOG 1ST an> suitable set 
uf data that someone here ma> v^ish tu use foi making Suwh 
comparisons. 

In the three parameter models, the ICC*s have the 
form Cf + (I ^ c/)¥[ag-(6 - bf)] . This mathematical form is 
not beyond challenge, as Samejima has,pointed out, but it 
Is relatively eas> to defend as a versatile form that fits the 
data\ so long as we do not suggest that examinees eithez 
know tlie answer to the item or else guess with probabi]it> 
of success We all know that examinees do not respond 
this way. If ICC theory were based on the dichotomy, 
knowledge or random guessing, It would not be credible. 
For this reason, it n)a> be best n6t to refer to as a 
'guessing parameter.* (I confess to violating this good 
advice.) 

9. When working with real answer sheets, if becomes 
necessary to deal with the problem of omitted responses. If 
we require the examinee to answer all Items, ^we are 
purposely introducing random error into our data. In 
addition, we are forcing an ^ammee who has/demonstrated 
4 certain level of pcrformar^ b>-his responses to ganfibfe 
on some possibily random events^, which may, if he is 
unlucky, destroy all the positive evidence pfabi]it> that he 
has displayed. 

If we permit the examinee to x)mit items, v/e cannot 
properly treat such responses as wrung. To do so would 
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, penalise the exammet who omits, in i^otupast^xju to the 
examinee who guesses. 

It seems at first thought that wc might simp!> treat 
omitted *terru as if they had not been administered at all. 
This vannot be correct, however. If we ignore omitted 
items, an examinee could win a very high estimate of ability 
simply by answering items only when he was completely \ 
sure of his answer. 

The fact that an examinee has omitted an item carnes 
information about his level that cannot be ignored. A 
method foi using this mformauon efficiently, under certain 
assumption:^, is outlined m a Psychometnka paper (Lord, 
1974). . . . 

10. 1 want to take this opportunity to make a concc- 
tion. In a 1968 paper (Lord, 1970), I wrote. 

If ii^ = 0.333, (index the ^umpfioiu jliC4d> made (ihct 
iclbbaity for a 60Htcm ic$l will be 0.80; if - 0.5, ihh 
reliability ^fciil be 0,90; \Sa^ = 1.0, this irliabOity will be 0.97. 
In vjcw of this, wc shall choosc tf^- - 0.S as a typical value -^nd 
shall address most of our attention to it. 

After seven years of experience with the parameter, . 
these reliabilities sound high. Actually, they are correct, 
but, as the assumptions stated, the> are for free response, 
nut multiple choice items. Urry made this same point this 
m<^ining. Since most of the cited paper dealt with multiple- 
choice items, it was a mistake to suggest - .50 as a typical 
value. Although the diagrams presented in that paper 
required the reader to supply his own values of a^, the 
general impression given was one of only limited enthusi 
asm for adaptive testing. 

Current results show that w.hen a^ - 0.9, a peaked test 
composed of 40 five-choice items should have a KR20 
reliability of .90. When ar^- is 0.9, the conclusions supplied . 
by the. diagrams in the cited paper are quite encouraging for 
the future of adaptive testing. 

11. The purpose of the cited paper was to evaluate 
adaptive tests in companson to conventional test^. To do_ 
this, the situation considered had to he a simple one. This 
was the reason for the use of a fixed-step-sizc up-and-down 
branching procedure. Such a procedure is not \q be 
recommended for practical testing. 

When the item parameters have been estimated and a 
computer is available for making the calculations, the 
choice of the item to be administered next should be made 
by checking all unused items (perhaps within a specified V 
item type) and selecting the item that is expected to give 
the most inforrnation about the examinee. 

if a Bayesian prior distribution of ability is being used, 
and if this, distribution is normal, this is Qwen*s (in press) 
procedure^ .frequently used today. In such a procedure, 
except for certain approximations each sjep is locall y 
optimal. We cannot expect local optimality to produce 
oxprall global optimality, but the difference may not be of 
great importance. 

12. When we select the r^ext item to be administered on 
other considerations besides item difficulty, we no longer 
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have an up-jnd-da»n bunching pii^wcJuic. Tlic next iHin 
administered aftei a cuncct xc^punsc mi^t be an ea&tci 
ftcm, not a harder Item. 

The recommended p/occdure mesuis that items uith higli 
Of will be used vety frequenii> and items with lu a will be 
used seldum oi not at all. The gam fium this use of (he best 



itenis will piob4bI^ muie than double the gain frum an> 
piocedare. syuh as the up and duwn pitxcduic, that selects 
items solely on item difficulty. 

Furthermore* the large: the item puul, the greatei the 
^n. This IS nut surprising. We alwa>^ knew that if we 
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Figure 1. SCAT 2B. A comparison of estimated parameters. The 
two-parameter model assumes a normal distribution of 
ability. Hach item in the plot is located by a di^it which 
represents item difficulty (b, + 3). The easiest items are 
indicated by a 0, the hardest tyia 5* 
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selected the best items from ten tests, we could build a 
sin^ test that would be much more reliable thin 2ny of 
the oripnal tests. 

13. My last point concerns the use of Bayesian inference 
in adaptive testing. When we are testing large numbers of 
examinees all coming from a single source, we are in a reali> 
cxqeptionally good position to obtain and use a prior 
dbtribution describing the examinees* It would seem 
negligent not to obtain and use such ^ readily available 
prior distribution. ' 

On the other hand,.! would like to make a simple point 
not often expressed. Bayesian inference based on a prior 
distribution will ^ve correct results when the prior corre- 
^nds, in some sense, to reality- It is likely to ^e 
incorrect results if the prior itself is incorrect. 
; In rnost Bayesian work, it is usually not practicable to 
determine v^e'ther »he prior is correct or incorrect. In our 
work, on the contrary, iLis fairly easy to do so. We need 



estimates will not be spoiled. by an Incorrect prior distribu- 
tion of ability provided the tesl administered is long 
enough. 

This is not the whole story, however. The assumption of 
a normal distribution of abBiiy, if false, may lead to 
unsatisfactory estimates of item parameters. The usual 
formula for biserial r can give absurd results if the 
continuous variable, in this case examinee ability, unknown 
to the statistidan, is far from normally distributed. Unlike 
some other effects of Bayesian priors, this'difBcuIty does 
not diminish as sample size becom^ large. 

Two different esthnates of the distribution of examinee 
ability ^or one set of data are shown in Figure 2, 
reproduced here from Lord (1974). Theayreemcnt between 
two estimates, obtained from very different assunip- 
tions, i^ves me some confidence in Aese results. My 
emj^rical results from other sets of data (including a 
representative axth-grade group) are amilar. When the 




Figure 2. Distribution of estimated d (histogram) and estimated 
distribution of (curve). Reproduced from Lord (1974) 
with pennission oTPsychometrika. 



only estimate the ability of each person tested and then 
look at the distribution of estimated abilities. 

If vve were testing unselected schoci children in grade 
school, a normal distribuUon of ability might possibly be 
found. When we are testing highly selected groups in college 
or elsewhere, it seerhs unlikely that we will find a normal 
distributipni 

Bayerians point out that the effect of an assumed.prior 
becomes unimportant as the riumber of observations 
becomes large. In our context, this iheans that our ability 



ability scale is chosen so that all item characteristic curves, 
are three-parameter normal opves^ or logistic curves. It 
turns out, for my data, that ability is not/normally 
distributed. 

14. Althou^ I an not a market analyst, I will Without 
much risk venture two aiwertions. Computer costs--if they 
have not ^already done so-will come down to the point 
where, computer-ba^d adaptive testing is economical. When 
this happejis, adaptive testing will come into wkle use. The 
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Tailoied testing lus been uSkcd zboui loi many yc^ 
acadeciic aides. In ihis iXfnkioiu, wt hsvt h&ard fixm 
p!ans for action. The promise of uilored testing is 
becoming reaL Numberless emulated examinees hxvt taken 
tailored tests and a substantial, thou£^ smzHts, numbei 
rea! peoj^ hare also had the experience. The use of 
tailored tests will provide substantially impr j\*ed effiucnw> 
and wiU have ^ a number of benefidal side effects, as 
mentioned bv McKillip 2nd Weiss, ^mong others. Tesun^ 
conations will be more nead> standardLced, the test »i!l 
hold the taker's interest because each Item will be a 
challehfe, posably there will be less test anxiety, fcedbadc 
may deaease racial bias. (Wdss; Jc^xnson, 1973)'* 

There will jalso be some harmful ^de effects, that we 
may as weU face. People will ha>*e trouble understanding 
the system, and complaints will be freqt:ent. Two people 
witli widely dfflerent abilities will both experience getting 
about half the items lights, yet get veiy different scores; 
one will be accepted, the other rejected. If these two people 
compare notes, they may be confused. The anti-testing 
forces are also for the most part anti<omputer, so negative 
voi^ will be raised. Security is at least as difficult with a 
computer system as with a paper and pencil system. But 
these are operational problems, and now is not the time to 
wony about thera They will all be solved, somehow. I 
merely list them to counter the tendency to believe that the 
milleniumis upon tis. 

Now let me make one thing perfectly dear. I am about 
to criticize aspects of the work reported at this coi^erenoe 
That is my job. But the one most important faq)f that 
outweighs all criticism, is this. The operational use of 
tailored testing is a ^ant step forward in personnel 
evaluation. Evidence ind'lates as much as a 2 to 1 gain in 
efficienqr, and possibly some very important side benefits 
I am completely convinced that this is an important step to 
take. My comments arc of two kinds suggestions for * 
clarifying and improving the theoretical basis for this big 
step, and impatience at our not yet hsvmg planned further 
giant steps. These steps should be justified no! in terms of 
'saving money, v^ch Hansen claims, but in terms of doing a 
better job. 

Let us now consider some of the technical problems In a 
computer-based system. We have heard two plaris for item 
analysis *'on-the-fly'", as they say in the computer trade. A 
''Uestion arises about some of the item analysis procedures 



' This vrotk, was done with 5upport from Grant GB37520 from 
the National Science I oundatton. The author u indited to Warren 
S. ToTgasoti for many fruitful discusaons of computer applications 
in testiiif and personnel dcclfion* 

^T)uou|hout, references to other papers in this conference are 
by author only, olhcx reference art followed by publication yeai. 



iUrr>« Jerrscnu) s^hiii^ ^till seem lu be buHt the biserial 
wonelation of the item with the ability suit, and the 
ovexall proportion of 4:orrea 2iiswet^ These raw data are 
repaiametenzed (to use an word that ^ould be 
banned fr^m vvnhzcd discourse) but the basic data are p^^ 
2nd P|- Both of these indices depend on the notion of a 
pupulaticn of test takers. Yet one purpose o( tailored 
testing is to av{^d the notion of. population. What, for 
e^,^ple, is the populaUon foi Lord's broad-iang: flexikvcl 
test of verbal abihty? Everyone from fifth gr^de to college? 
In tailored testing, it would seem that the item parameters 
must bt based ca the regression of the item on the ability 
scale. This sounds a little drcular-perhaps it is. Some sort 
of iterative optimization process would be needed at the 
start, to ever get the ability scale in the first place. CM 
described one su3h procedure for his ordinal scale model, 
an cquhralent procedure could easOy be devised for the 
metric model 

QifT^ procedure also depends on a population. He goes 
so far as to say that the purpose of a test is to rank cx^der 
ihe pupuIatiiTn of exammees. Sometimes it is, but often it is 
noL Often ihe purpose is to categorize the examinee as 
qualiGed or not qualified for a particular job. Or even 
better, to ^yt a qu^iUtative index of the degree of 
qualificatron. The only population we are really interested 
in is the population of successful job holders. 

There are other technical problems with Oiffs scheme, 
which he promises to solve. For example, he did not 
describe what happens when a persun^s item responses have 
contradictory implications fox other cells m his matrix. 
Indeed his system probably tnes to avoid asking questions 
that nught provide contradictory information. 

The main reservation I have about the technical side of 
tailored testing is the commitment to latent trait theory. 
The concept of a latent abihty scale is a great improvement 
over the concept of a true score. The true score model was 
never a very good idea, rather, it a simple model that 
worked pretty welL But are we sure that the latent ability 
'score is much better? Does the latent uait model fit the 
tests for which it is used? Is the assumption of local 
independence really tenable? Suppose, for example, that 
there are secondary factors in comnxm among subsets of 
Items. How much difference would that make? Nobody 
knows. 

The pomt is that latent trait theory is a theory, just as 
any other behavioral theory, and it n^ds verification. 
EmfHiical work is needed to show that latent ability scores 
work as the theory predicts* Simulated examinees will not 
do studies are needed with real people. Are the scores 
irr/ariant over item selections, or over samples of individ- 
uals? Does the precision of measurerpent really work the 
way the information variable says it does? What about the 
relation of validity to test length or information? Empirical 
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work his been presented by Waters 2nd oiheis, but vihcthn 
it supports the theory is xiot dear. 

Oassical test theoiy hzs a cunous status, most ps^dralo- 
psts and cducatozs belin'e that it is fact, not theory. 
Nowhere in Lorf & No\ic3c*s treatise is there a section oa 
enqarical TtriOcation of the theoiy. ActuaSy, test theory is 
a self-consistent, much-daborated theory that seems to 
TOik pretty wt]L For examife. the Spcarman*rou7i 
tommk imiaUy s^iks. Some people look upon the 
Speannan-Brown fonxnila as a fact It u a fact only in the 
seqse that it is a lopcal consequence of the base assump- 
tions of the liieoiy. So far as I know, neither true score 
theoiy nor latent tiait theofy i>as been put to a ciitica! test, 
as have most other mathenntica! theosies of behavici. 

One final theoretiol issue needs danfication. The 
literature contaLns results (cf-. Lord. 1970) indicating that 
a tailored test is not much more effective than an ordinary 
test with 3 peaked item difliculty distributiCKL The advan 
t|se lies m^y in the extremes. But the theoretical and 
empirical results presented in this conference indicate that a 
tailored test is much better even in the mid-ran^. Work is 
needed to clarify when a taSored test w31 help and when it 
won't- 

One final point about technical tenninolqgy. In the 
simtilation studies of Jensema, Waters, McBride, and others, 
the estimated abilit> which is the test score in uaiofed 
testing, is supposed to be ncariy B.Tht doseness of ^to 6 
h measured both by (1(6 §)*/N) . ^hich was called the 
**standard error^ and by r^g, which was called the 
**validily*'. In en^neering, the formei measure is commonly 
called the root-mean-squarc eiroi.oi R-MJS-error,it isnot, 
after all^ a standard error, since it*s not a standard 
deviation. Mean square error includes both error variance 
and squared bias. Thus the measure is very appropriate, but 
it is misnamed. To call the "vaUdity** is much worse.it 
is downright sinful. This use of the term goes bade, Im 
told, to Ijcdyard Tucker and Hubert Brogden, but that only 
proves that people in lugh places make mistakes. A 
different word must be used. ''Validity" is seriously 
mbleadmg, and has even been mis-interpreted at this 
conference. My own candidate for a name for r^g is 
'•fidelity''. 1 hope the in-group either uses '^fidelity'' or 
finds another word. 



NextSteps^ 

Now that tailored testing is about to become opera- 
tional, perhaps it is time to take 2 longer range perspective. 
Do the present developments really exploit the power of an 
Interacthre computer? Many sdentists, in their first en- 
counter with a computer, use the computer mainly to do 
faster and ^eater what they were already doing before 
computers. It is as if the ^ horse and buggy industry's 
reaction to internal combustion en^nes had been to build a 
medianicaj h(Srse. Statistical computation is a good case in 
, point To a very large extent, statistics is still at the, 
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xncdamcal horse stage an Its use of computers. The 
statistical pro-am padcages are fast i^ays to do old 
Ihings ana^^sisofvarianceyregresssoD^faao] analysis Even 
the fe» new 2}±2s. sadi as nonmetric scaSng and cluster- 
ing, haj their roots in pie-computer ideas. Interactive 
statistical methods are stiD in their infancy. Mostly^ 
interaction means repladng the con\xo\ cards m an input 
deck by ques«iobs printed by the machine and insT^ ered by 
the tiser on the No subtle interplay of htmun 
judgment and cc^puter^peed is implied 

The mechanical horse nage in computerized testing 
woiild be an automatic test production system- Given the 
diaitcteristic of z population, tlie computer would select 
the most appro|^te items from its item files and would 
print a suiuble tesLl naivdy thou^t testing had avoided 
thas typcal first stage, but apparently such systems wtit 
bidlt. some years ago. 

Tailored testing is cnje step beyond the mechanical horse 
stage. To be sure, tlie up^d-down method had seldom 
been used in mental testing, barring Biaet^ who didn*t do it 
light, but the up-and-down method is an old standby 
in psychophysks. and in senstivily testing generally, datmg 
from Wbrid War II and cadier. Also, test theoreticians knew 
that measurement was i>est when the items were all 
suffident!> difficult that the examinee got about half of 
them correct. (Actually about 68% for S-altemathre items. 
Fred Lord reminds me. because of guessing.) This is one . 
part of the theory that none of the operational people 
believed, but the theory was there. So the adapth'e test was 
a natural next step in computer invc^-ement m testing. 
Still, the only use of the computer in tailored testing, apart 
from the trivial use in presenting the items on a terminal^ is 
in selecting the next item and compuUng the abibty scores 
The sarne S-chdce items are being used, the jiem is scored 
dthcr right or wrong, the same kinds of traits arc being 
measured. Now is the time to move on. in research at any 
rate, to better things. 

Many more opportunities exist Some have been men- , 
tioned at this conference. Samejima proposes that we use 
the particular wrong choice of an item as partial informa* 
tion. Some wrong chdces are better than others. Item 
response wdghting has minimal utility in.standard tests, 
primarily because of the test Jcngth. Wdghting becomes 
more useful -with fewer items, whidi is just what Uilored 
testing provides. In addition to Samepna's proposal, even 
more Lnformation could be obtained, when' the response is 
wrong, by asking for a second try. The procedure of trying 
alternatives until getting the answer goes back to the 
194ffs or earlier. In those days. Science Research Assod- 
ates sold a punch board on which answers were punched 
ojit. Instructions were to punch out alternatives until the 
red dot appeared, s^nalling the r^t choice. The item score 
was the number of unpundied chcxces, except that omiu 
got a negative score. I am told that test scores based on 
these item scores were consistently more reliable and more 
valid than scores based on a I-O item scoring, the computer 
terminal is an elegant punch-board! Another posubilfty is 
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to hw^-Ac exanunec rank oi ntt ihz altcrmihrcs fw 
suitability. Tbe probabi&ty asagnmeol proposal of Sinford 
eL iL, (1966) sow bcinj tned by Weiss and his coftx>rkei5 is 
cqjamk^tr, tbou^ the restriction that the ratings must add 
to Ofit, Gke probabilities^ is sn unrortunate cosnpfication 
that is likely to hare adverse cperational consequences. 
Ratisfs cjj ranldnp would be better. 

The computer pennits the use of constructed re- 
^Kxises-fHl in the blanks -rather than mtdtlpk dmct. 
Conspuitt processng of constructed xe^nses has been 
wodced on in computer assisted instruction; these tech* 
niques could be adapted to the tesdn; ^tuaiion. Most of 
our present item types have eroh'ed in a multiple dioloe 
environment, and constructed responses would be no hdp. 
For example, some verbal sialo^s items would sot work 
as constpjcted responses - c^, "Brick is to buSding as 

leather is to Others would woric: *^oc is to 

foot as helmet is to .^Thc difiiculty of vocabu- 

laiy items is controlled almost entirely by the distractois, 
so askinf the examinee to construct a syncmym would 
markedly alter the item. But there is no ttiion why ntw 
item types cannot trdtve in the new context Verbal 
fluency is a natural for the computer to test, and virtually 
impos^Ie in the multiple chdce context 

Of moie interest is the possibility of new types of 
items, and new types of trals. The GRIP tests of Coiy are 
e^>eciany interesting, as arc some of the items briefly 
mentioned by Weiss, sodi as his conceptual maze. Many of 
these types^ can be tried on present day alphanumeric 
terminals, others need graphic terminals, wtndi are at 
present too costly, but whidi may soon be relatively 
inexperxsivc. 

i am convinced that the potential for new styles of 
items, or contingent sets of items, is the next important 
contribution of the computer. After all, we already know 
how to measure verbal ability and quantiUtive ability. The 
computer merely pves us eflidenty. What we need is more 
infoimation. 

The computer could also be immensely helpful if we 
placed less empharis on measurement and more on the 
decision process. Instead of providing a test battery, we 
could provide a decision ^stcm. Many years ago Cronbadi 
Sc. Gleser (1965) argued for the necessity of coupling the 
dedaon process with the testing process. The computer, 
and computer assisted testing, have provided an unparal- 
leled o|^>ortunity to do this. Hansen, MdCiIlip,& Lord have 
mentioned this. 

Cooader the simple example of selecting among appli- 
cants for a particular job or for entry to a particular coD^'. 
The test's job is to label Mdi taker as qualified or not 
qualified. This implies a cut-offjKore, or at least a cut-off 
re^on. The vcjy well qualified and- the very pooriy 
qudified persons can probably be idenUfied relathrely 
quickly; most of the effort should be 5)ent on the 
.bordcriine cases. To be sure, we must beware of Lord's 
lucky fuesser, and Weiss* low comSstency scorer, but with 



care, ai efficient system can be de\ised that does not 
measure accurately at all levels, but only where it counts. 

AonenSmensioaalcasciscoIy thcbeg^nim:^ Both Weiss 
and Hansen have suggested that additional savirigs can be 
made when there arc severil relevant dimensions. Here, 
progress reqimes that the decision process be coupkd with 
die testing process to build a coo^te system. 

There arc many different approaches to a persocnd 
deciflon system. One model would treat jobs as repons in a 
space whose dimensions are ^>ecific job requirements, 
spedGc aMities, or characteristics needed for the job. A 
. person is a pwit in this ^lace, the testing problem is to 
pnpoint the person's poation sufBdently accurately to be 
a!^ to list the' jobs for which be is <piaii9ed, a»i poc^ 
to list these in rank order from tbe ones for whidthe is 
most qualified to tbe ones for which he is barely qualified^ 
The diroeittioos of the job space m^t be abilities, or they 
mi^t not And indindaal items serve to locate a 
person on omy one dimensioo, or items niijht hdp to 
locate a person in the total space. At least, t^ hikOM 
priori reason for discardingimpurt imiltidimeftaooal items. 
Indeed sudi items m^t be espedally useful in a dcciaoa 
system. 

Five years ago at a sinnlar conference (Green, 1970) I 
said that the computer had a great future in testing. Today, 
happily, it has a present as wcU as a future. Operational 
veroons of taDorcd tests represent a great technical adiievc- 
ment- Furthermore, the computer plays a central role In tiie 
enterprise. Still, the potential of the computer has barely 
been tapi^ The future lies ah^d. 
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ANNOUNCEMENTS 

Dr. Robert-L Gettelfingcr of EducaUonal Testing Serwe 
announced that organization's wiUiiainets to edit a news- 
letter on the subject of compufer-auisted testing. He asked 
for suggestions as to the content of the newsletter, and for 
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ihe oiJiaioas of the ccnTerees as to %iut aibjewl sazim Psj-dit/hpcdil Measurement, Hax roB pabtsh cmpaioi 

should bf corercd and as lo wiietha coatribuiions shodd xcstai ua iht ^ppliuaUwa t/f icJmjqae& <?f p^ydiulogicsi 

be entirdy volunuo' or diould be oblaiswJ b> is%iin^ nicasjrcjDcnl lu subjlantne problrra m afl areas of 

V^P^^ ps>'do3o£y 2nd rdaled cJisdplincs sudi as sodology and 

politics! sdence. He imited coafercnce participanis to 

Dr David J Weiss of the Mmtmiy of Minncsoia SBbnat thcii p^w and poniiscd to send fuithci details to 

announced that he will edit a new journal. Applied all parlidpant$« 
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